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ABSTRACT 

The  Lexical  translator  is  a  program  written  in  Turbo 
PASCAL  to  generate  a  Latin  PASCAL  source  code  from  an  Arabic 
PASCAL  source  code.  The  Arabic  code  is  written  under  a 
bilingual  operating  system  transparent  to  the  DOS  on 
personal  computers. 

The  bilingual  operating  system  compatibility  as  well  as 
the  Arabic  characters'  code  values  is  investigated.  The 
Latin  code  is  fed  into  a  computer  to  be  compiled  and  run 
with  a  Latin  interpreter  (i.e..  Turbo  PASCAL),  in  an  Arabic 
environment. 
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I.   INTRODUCTION 

The  English  language  is  the  most  popular  scientific 
language  used  today.  The  language  descended  from  Latin  and 
has  had  wide  use  in  the  scientific  field.  The  English 
alphabet  is  familiar  to  people  in  Europe  and  all  countries 
who  use  languages  descended  from  Latin.  There  are  slight 
changes  between  the  various  alphabets  that  have  descended 
from  Latin. 

The  wide  use  of  Latin  alphabets  has  made  it  easy  to  set 
standards  for  typewriters  and  console  keyboards.  The 
similarity  in  grammar  common  to  most  of  them,  their  fonts 
and  direction  of  flow  (i.e.,  left  to  right)  has  made  it  easy 
to  standardize. 

Keep  in  mind  that,  many  of  the  computer  pioneers  made  an 
effort  not  to  limit  the  implementation  of  their  software  to 
one  spoken  language.  Software  is  the  key  to  any  limited  use 
of  computers  in  any  language.  Typically  lack  of  knowledge 
of  programmers  in  a  foreign  language  limits  their  ability  to 
write  applications  acceptable  to  the  user.  Not  so  many 
nations  are  blessed  with  the  computer  development 
technology.  However  all  nations  have  people  who,  as  users, 
are  capable  of  contributing  to  humanity  using  this 
technology. 


Given  the  technology  existing  today,  if  we  can  create  an 
interface  between  a  host  foreign  language  and  a  target 
application  language  there  will  be  fewer  barriers  to  nations 
that  do  not  use  a  standard  English,  French,  or  German-based 
computer  operating  systems  and  software.  The  interface  will 
accept  user  commands  from  the  host  environment  and  translate 
it  to  the  syntax  of  the  target  environment.  It  is  assumed 
that  the  user  is  knowledgeable  in  the  semantics  of  the 
target  environment  in  his  spoken  language  terms. 

The  question  may  be  asked,  "what  good  will  this  approach 
do  such  a  nation?"  There  are  several  good  points.  Two  of 
the  most  important  reasons — One,  there  is  a  good  library  of 
software  that  exists;  and  two,  the  price  of  software  (even 
with  the  addition  of  an  interface  communicator)  is  less  than 
newly-written  customized  software.  It  is  faster  and  easier 
to  write  an  interface  than  to  rewrite  a  large  body  of 
software. 

Two  user  environments  should  not  be  confused.  The 
customized  foreign  alphabets  used  in  many  countries  on 
mainframes  for  specific  applications  are  developed  by 
contractors  who  are  expert  in  that  application  but  not 
necessarily  the  foreign  language.  The  mainframes  must  use 
the  software  provided  by  the  original  contractors.  It  takes 
a  lot  of  effort  and  capital  to  develop  new  software 
application  for  the  special  machine.  This  limits  the  use  of 
the  computer  to  operators  and  data  entry  personnel  with 


minimum  creative  programming  from  the  user  side.  Users  do 
not  share  the  expertise'  of  others  and  the  continuously- 
improving  software.  This  is  because  there  are  limited  users 
and  minimum  feedback  to  software  developers. 

The  second  user  environment  is  the  average  user  who  has 
some  scientific  background  but  has  no  access  nor  the  capital 
to  invest  in  mainframe  hardware.  This  user  is  often  an 
educator,  student,  or  a  professional.  This  category  of 
users  has  great  potential.  The  use  of  software  with  a 
native  language  interface  would  be  very  helpful  and  afforda- 
ble at  the  same  time  to  this  group.  This  group  is  very 
capable  of  contributing  in  their  respective  fields  with  the 
powerful  processing  features  available  with  personnel 
computer  technology  today. 

This  thesis  is  concerned  with  the  second  user  environ- 
ment for  several  reasons.  The  second  group  of  users  are  the 
creative  ones.  Their  understanding  of  computers  and  its 
applications  is  a  major  step  toward  building  the  target 
machine  with  compatible  native  standards.  This  will  elimin- 
ate the  ad  hoc  design  by  the  contractor  who  most  of  the  time 
has  to  hire  a  non-technical  translator  and  dictate  to  them 
the  language  specification,  key  words,  and  commands  of  the 
operating  system,  or  query  language.  Usually  a  translator 
will  translate  the  machine  native  language  key  words  to  the 
target  language  using  its  alphabet.  The  translator  may  have 
minimal  programming  or  computer  experience.   This  will  most 


likely  lead  to  an  ambiguous  environment  for  users  to  work 
with. 

The  feasibility  of  such  an  approach  is  constrained  by 
several  factors.  The  language  or  the  user  environment  is 
one  factor.  How  is  the  language  implemented  or  emulated  on 
standard  Latin  language  hardware?  The  target  machine  (i.e., 
micro  to  mini  computers)  compatibility  with  others  in  the 
same  family  is  also  a  factor.  These  are  factors  that  affect 
feasibility.  Economical  feasibility  is  based  on  demand  and 
supply  and  a  developer  must  evaluate  the  benefit  vs.  the 
development  cost  in  order  to  develop  such  interface 
software. 

The  Arabic  language  is  a  very  rich  language  in  vocabu- 
lary and  historical  background.  The  Arabian  alphabet  is 
very  old.  The  language  was  used  for  several  centuries  by 
leading  ancient  mathematicians,  physicians,  biologists,  and 
chemists.  They  successfully  contributed  in  their  fields 
using  the  Arabic  alphabet.  Their  numerals,  symbols,  and 
equations  were  all  written  in  Arabic.  However  this  does  not 
make  it  simple  to  use  the  Arabic  alphabet  in  the  modern 
computer  environment. 

One  reason  is  that  the  direction  of  flow  in  reading  and 
writing  is  from  right  to  left.  Secondly,  Arabic  characters 
are  not  printed  like  Latin  characters.  Arabic  words  are 
printed  like  calligraphy.  Arabic  characters  must  be  either 
written  in  stand  alone  or  connected  form.   The  character  may 


be  located  in  one  of  three  ways:  at  the  beginning  of  a 
word,  in  the  middle,  or  at  the  end  of  a  word.  With  a  set  of 
complicated  rules  the  shape  of  a  character  is  determined  by 
its  location  with  respect  to  the  word.  This  difficulty  has 
complicated  attempts  to  provide  a  software  emulation  to  the 
Arabic  environment  in  personal  computers. 

The  goal  of  this  thesis  is  to  provide  an  approach  to 
solving  this  problem.  The  steps  that  must  be  followed  will 
be  described  in  addition  to  special  consideration.  To  show 
that  translation  is  possible,  we  will  develop  an  interface 
to  communicate  between  an  Arabic  form  of  source  code  in  the 
PASCAL  language  and  an  existing  English  PASCAL  compiler. 
The  interface  will  use  sample  source  code  written  in  Arabic 
and  Lexically  Translate  it  to  English  source  code.  The  goal 
is,  given  correct  Arabic  source  code,  the  interface  will 
produce  correct  English  source  code.  This  should  be  done 
once.  Once  the  program  is  compiled  the  interface  step  is  no 
longer  needed  with  the  compilation. 
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II.   BACKGROUND  ON  ARABIC  CHARACTER 

A.   INTRODUCTION 

There  are  28  basic  characters  in  the  Arabic  alphabet 
(Figure  1) .  However,  these  basic  characters  are  not 
sufficient  for  use  with  computers  or  typewriters. 
Authorities  agree  [Ref .  1]  that  the  optimum  set  should  use  a 
minimum  of  31  characters  (Figure  2) ,  three  more  characters 
than  the  original  set.  The  additional  3  characters  are 
needed  to  constitute  the  optimum  set  for  representing  Arabic 
texts.  One  may  check  the  Kufic  script,  which  is  over  1500 
years  old,  to  realize  that  engravings  by  ancient  Arabs  were 
done  with  close  to  31  characters.  Each  character  has  one 
shape.  Over  the  years,  variations  of  the  characters  have 
developed  for  ease  of  writing  and  reading.  Each  character 
may  have  from  two  to  five  shapes  depending  on  its  location 
within  a  word.  All  applications  must  use  these  variations 
as  standards  to  represent  Arabic  texts.  Implementing  the 
variation  is  critical  for  compatibility  issues.  Code 
representation  of  any  variation  must  follow  a  strict 
standard  to  insure  survival  among  other  implementations. 

The  Arabic  alphabet  has  only  three  vowels  in  the  2  8 
characters  (see  Figure  3  for  the  alphabet  names) .  Vowel i- 
zation  is  also  performed  through  the  use  of  diacritics  (see 
Figure  4) .    Most  Arabic  texts  do  not  show  diacritics. 
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Readers  have  learned  to  read  and  understand  the  word  based 
on  the  context  of  its  use.  If  misinterpretation  is 
critical,  verifications  are  provided  in  parentheses.  Most 
applications  today  do  not  require  diacritic  symbols. 

The  Arabic  numerals  and  Hindu  are  used  in  the  Arabic 
world.  North  African  countries  use  the  Arabic  numerals  (as 
used  in  Latin)  .  The  Arabic  name  is  given  to  the  numerals 
used  in  Latin,  and  Hindu  numerals  are  used  by  most  of  the. 
Arabic  world  (Figure  5)  .  However,  history  books  show  that 
both  systems  originated  in  India.  The  Arabic  language  uses 
the  Latin  comma  for  a  decimal  digit  to  be  distinguished  from 
the  Arabic  number  zero  which  is  the  Latin  decimal  digit  ".". 

B.   ARABIC  LANGUAGE 

The  Arabic  language  differs  from  languages  descended 
from  Latin  in  several  ways.   The  primary  differences  are: 

*  Arabic  is  written  right  to  left  instead  of  left  to 
right. 

*  The  representation  of  vowels  by  using  diacritics  in  the 
form  of  over  or  under  scores  with  most  letters  within 
the  words. 

Secondary  differences  are: 

*  Letters  in  Arabic  may  be  joined  or  not  according  to 
location  within  the  word.  A  particular  letter  may  be 
joined  to  the  preceding  letter,  and/or  following  letter. 

*  Each  letter  has  between  two  and  five  different  forms 
dependent  on  its  contextual  position. 

Lexically  the  Arabic  language  can  be  defined  in  BNF 

notation  as  follows  [Ref.  l:p.  28]: 
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<language> 
<sentence> 
<word> 
<character> 
<voc .  syin> 


* 
=  {<sentence>}]^ 

* 
=  ( <word> } 

* 
=  { <characters><voc .  syin><character> } 

=  {  see  Figure  1.   }* 

=  {  see  Figure  4  .   }* 


C.   WRITING  ARABIC 

Writing  in  Arabic  flows  from  right  to  left,  additional 
lines  start  from  right  to  left  beginning  below  the  previous 
line.  A  word  is  entered  by  typing  the  first  character  at 
the  cursor  position  followed  (to  the  left)  by  the  next 
character.  An  example  of  this  is  the  word  "hello."  If  the 
same  word  is  entered  in  Arabic  it  will  be  entered  as 
follows: 

cursor  position  _< 

step  1.   enter  character  "h" _h< 

step  2.   enter  character  "e" _eh< 

step  3.   enter  character  "1" _leh< 

step  4.   enter  character  "1" _lleh< 

step  5.   enter  character  "o" _olleh< 

This  demonstrates  the  direction  of  flow,  however  if  one 
should  worry  about  each  character  shape,  it  may  seem  tedious 
for  long  text.  In  some  applications  one  must  provide  dia- 
critics also.  In  short,  typing  one  vocalized  word  seems 
like  a  puzzle. 

There  are  rules  governing  the  shape  (form)  of  the  letter 
based  on  its  contextual  position. 
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Dewachi,  Abdulilah   [Ref.   l:p.   27]  has  the  following 

opinion  on  the  rules: 

These  rules  have,  in  my  opinion,  been  developed  for  ease 
of  handwriting  and  have  no  bearing  on  the  semantic  and/or 
syntactic  requirement  of  the  language. 

In  spite  of  the  cause  or  the  reason  for  the  development  of 
the  rules,  all  books,  newspapers,  and  magazines  in  the  Arab 
countries  today  are  written  using  those  rules.  They  will 
also  stay  that  way  for  years  to  come. 

Arabic  letters  are  cursive  in  shape.  The  implementation 
of  the  alphabets  is  highly  dependent  on  how  legitimate  the 
characters  look.  The  cursive  nature  of  characters  requires 
that  both  monitor  and  graphic  adapter  provide  good  resolu- 
tion. High  resolution  is  also  required  for  supporting 
correct  vocalization,  as  previously  discussed. 

D.   ARABIC  NUMERALS 

Both  the  eastern  Arabic  numerals  and  the  western  Arabic 
numerals  (Figure  5)  are  used.  Countries  like  Algeria, 
Morocco  and  Tunisia  use  the  western  Arabic  numerals.  The 
numeral  system  is  not  a  critical  issue  since  in  both  repre- 
sentations they  have  the  same  value. 

Many  people  believe  that  the  Arabs  write  the  numbers 
from  left  to  right.  This  is  a  misconception.  The  language 
books  and  schools  teach  the  classical  way  of  writing  and 
reading  the  numerals.  The  classical  way  is  to  either  use 
the  words  ("one" , "two" , .  .  . )  or  the  numbers  ("1" , "2" , . . . )  in 
writing  starting  from  right  to  left.   For  example  the  number 
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52  3  will  be  written  in  Arabic  as  "three  and  twenty  and  five 
hundred."  It  may  sound  wrong  in  English  composition  but 
this  is  the  syntax  that  classical  books  use.  This  method 
should  be  encouraged.  This  is  also  followed  in  reading  the 
numbers . 

The  most  common  method  in  handwriting  numbers  is  to 
write  in  the  order  they  are  said.  An  example  of  how  numbers 
are  read  and  written  today  is  the  year  1986 — pronounced  as 
"One  thousand  nine  hundred  six  and  eighty.  Notice  the  six 
comes  before  the  eighty.  Writing  the  number  "1986)  using 
numbers  is  done  as  follows: 

first  digit      1 

second  digit     19 

third  digit      19_6 

fourth  digit  1986 
This  method  is  far  too  complicated  tq_be  adopted  by  mechani- 
cal machines.  The  classical  method  should  be  encouraged  for 
another  obvious  reason.  The  numbers  are  entered  least 
significant  bits  first  in  low  memory.  From  the  computer 
hardware  point  of  view  the  adders/subtractors  may  work  on 
the  number  before  the  complete  number  is  loaded  [Ref .  1]  . 
This  is  the  more  efficient  way.  Also  both  numbers  and 
strings  will  be  right  justified. 

This  chapter  has  outlined  the  major  concerns  and  differ- 
ences between  the  Arabic  and  Latin  alphabet.  There  are  a 
few  more  things  worth  noticing.   The  opening  brackets  "[", 
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"{",  and  "("  are  the  closing  brackets  in  Arabic  and  vice 
versa.  The  Arabic  question  mark  has  the  same  look  as  "?" 
but  rotated  180  degrees  around  its  vertical  center.  A  list 
of  a  complete  code  set  including  special  characters  is 
included  in  the  ARCH  code  set  (Appendix  D)  .  ARCH  will  be 
discussed  in  detail  in  later  chapters. 
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III.   CONTEXTUAL  PROBLEMS  IN  ARABIC  WORDING 

For  any  computer  to  work  in  Arabic  it  must  also  be  able 
to  handle  English  alphabets.  Arabic  users  will  pay  a  few 
extra  dollars  to  add  the  bilingual  features  in  purchasing  a 
computer.  The  form  of  the  bilingual  feature  is  a 
controversial  issue.  This  chapter  will  show  why  one  should 
be  concerned  in  using  mixed  mode  or  even  alternative  between 
the  two  alphabets — Latin  and  Arabic. 

There  are  three  major  differences  between  alphabets 
descended  from  Arabic  and  Latin.  The  differences  are 
direction  of  flow,  diacritics,  and  variant  location  shape  of 
characters.  These  issues  are  specific  to  the  language. 
This  chapter  will  discuss  these  issues  with  respect  to  the 
computer  environment. 

Each  difference  requires  special  attention  in  an  Arabic 
alphabet  implementation  in  hardware.  The  direction  of  flow 
in  reading  and  writing  is  very  complicated  for  users  and 
developers  alike.  This  is  especially  true  where  the 
keyboard,  the  display,  and  the  printer  are  to  operate  in 
bilingual  mode.  Arabic  is  read  and  written  in  the  opposite 
direction  to  Latin.  The  difficulty  is  when  the  user  wants 
to  flip  to  the  other  mode  for  another  application,  or  within 
the  same  applications  the  user  wishes  to  mix  both  character 
sets. 


17 


A  boom  in  the  introduction  of  electronic  computing  to 
the  Arabic  world  lead  manufacturers  to  make  short  cuts  to 
meet  the  complicated  needs  of  the  Arabic  alphabet.  Also  the 
Arabic  alphabet  is  used  in  several  countries  with  non-Arabic 
languages.  This  wide  use  invited  companies  to  quickly 
develop  a  character  set  for  Arabic,  based  on  limited 
research.  As  a  result  important  language  needs  such  as 
diacritics  were  avoided.  This  also  has  lead  to  a  delay  in 
the  realization  of  an  effective  solution. 

The  contextual  problems,  that  is,  the  variant  shape  of 
characters,  is  the  most  difficult.  To  establish  a  solution 
is  to  decide  the  style  or  the  method  that  developers  should 
follow  in  implementing  Arabic  character  sets.  The  problem 
is  the  complexity  of  providing  to  the  user  all  shapes  possi- 
ble for  the  28  character  set  on  the  keyboard.  Each  charac- 
ter has  between  two  to  four  shapes,  making  for  a  total 
requirement  of  84  codes  to  represent  the  minimum  set  of  the 
Arabic  alphabet.  This  number  is  higher  by  50  percent  than 
what  the  English  alphabet  (upper  and  lower  case)  requires. 
The  rest  of  the  special  characters  and  diacritics  require 
more  codes.  In  some  cases  the  applications  of  diacritics  to 
some  charac-ters  requires  a  unique  shape  to  represent  it. 
This  requires  a  unique  code  for  the  combination  of 
characters  and  diacritics.    The  use  of  "Hamzah"^  also 


■'■The  "hamzah"  is  one  of  the  three  characters  that  were 
added  to  the  alphabet  in  addition  to  the  original  character 
set. 
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requires  special  attention  when  used  with  any  of  the  three 
vowels  in  the  alphabet.  The  limited  number  of  codes  the 
keyboard  has  is  the  limiting  factor  for  planning  the  code 
assignments.  A  look  at  some  efforts  and  proposals  will  be 
discussed  in  the  following  chapter. 

A.   DIRECTION  OF  FLOW 

Working  in  mixed  mode  is  considered  a  must  in  the  Arabic 
environment.  There  are  two  approaches  to  handle  the  mixed 
modes  data  entry  and  storage  problem.  One  approach  calls 
for  the  data  to  be  stored  in  aural  order  (i.e.,  logical 
order)  .  The  second  approach  is  to  store  the  data  in  the 
same  order  as  it  looks  (i.e.,  visual  order).  Keep  in  mind 
that  if  an  Arabic  word  is  inserted  in  English  text  the  last 
character  of  the  word  will  be  encountered  first,  scanning 
from  left  to  right. 

One  approach  places  the  burden  on  the  display  to 
translate  the  incoming  data  to  the  correct  direction  to  be 
displayed.  The  display  must  translate  an  escape  code  or  a 
mode  bit  sent  with  the  data.  The  easiest  method  is  to  set  a 
high  bit  (if  it  is  not  used)  as  to  whether  the  character  is 
Arabic  or  Latin.  This  option  calls  for  smart  display 
devices. 

The  second  approach  is  to  store  the  data  in  aural  order. 
This  approach  places  the  burden  on  the  computer  to  determine 
how  to  store  data  to  cause  no  shifting  of  display  direction. 
This  means  the  display  program  will  keep  track  of  the 
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language  mode  and  do  order  reversing  to  store  the  data  in  an 
appropriate  order.  In  handwriting,  handling  mixed  modes  is 
done  in  the  following  fashion: 

-  continue  typing  until  reaching  a  foreign  character. 

-  count  the  number  of  spaces  occupied  by  foreign 
characters  up  to  the  first  native  character. 

-  skip  that  number  of  spaces  and  write  back  to  where  you 
stopped  before  skipping.  When  done  the  writer  should 
end  where  he/she  jumped  from. 

-  skip  the  same  number  of  spaces  you  counted.  This  is 
where  the  next  native  character  belongs. 

It  seems  that  humans  can  do  this  routine  more  easily  than 
computers.  The  computer  can  only  deal  with  incoming  data  as 
it  arrives,  one  character  at  a  time.  This  means  the 
computer  does  not  know  in  advance  how  many  foreign  charac- 
ters are  coming.  The  computer  can  use  a  logical  device 
called  a  stack.  Characters  of  different  mode  are  stored 
(pushed  on  the  stack)  up  to  the  next  native  character.  At 
this  point  the  computer  has  the  foreign  string  in  reverse 
order  on  the  stack.  In  the  next  step  the  computer  starts  to 
write  from  the  top  of  the  stack  until  no  more  characters  are 
in  the  stack.  Then  the  program  continues  with  the  last 
encountered  native  character.  In  this  approach  the 
direction  of  flow  for  the  display  is  maintained.  Obviously 
this  method  has  several  disadvantages.  One,  it  slows  the 
storing  of  data  in  mixed  mode.  Two,  it  slows  the  computer 
from  doing  other  functions,  where  a  smart  display  could 
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handle  the  display  of  mixed  mode  data  as  they  are  stored 
logically. 

The  approach  that  should  be  taken  is  connected  with 
resolving  the  contextual  issue,  the  variant  character  shape 
problem. 

B.   ARE  DIACRITICS  REQUIRED? 

By  linguistic  standards  the  omission  of  diacritics  by 
computers  murders  the  Arabic  language.  Linguists  have 
always  officially  criticized  the  mispronunciation  of 
statements  by  television  and  radio  people.  The  use  of  dia- 
critics is  a  must  in  the  language  even  by  recommendation  of 
westerners  involved  with  the  Arabic  alphabet  [Ref.  1;  pp. 
39-46] . 

In  a  previous  chapter  diacritics  were  discussed.  There 
are  five  basic  diacritics.  The  five  are  (Figure  3)  from 
right  to  left:  "Fat_ha",  "Dammah",  "Kassrah",  "Sukoon",  and 
"Shadah".  The  first  three  can  be  doubled,  in  the  same 
manner  as  double  quotes  in  Latin.  When  any  diacritic  is 
doubled  it  is  known  as  "Tanween"  and  adds  an  N  sound  to  the 
character.  The  Shaddah  has  the  same  effect  as  doubling  the 
consonant  in  English.  It  can  be  used  inconjunction  with  any 
of  the  first  three  or  their  "Tanween."  The  Sukoon,  when 
used,  means  that  the  character  must  be  read  in  primitive 
form,  versus  using  previous  diacritics. 

An  example  of  one  word  using  different  diacritics  will 
show  how  the  sound  and  subsequently  the  meaning  changes. 
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The  word  pronounced  "tilmeeth"  in  Arabic  means  a  student. 
The  "th"  at  the  end  is  the  character  "Thai"  in  Arabic.  The 
example  will  show  the  different  sounds  per  word  when  only 
the  last  character  has  different  diacritics. 

WORD  VOWELIZATION         PRONOUNCED 

TILMEETH  "FAT_HA"  TILMEETHA 

TILMEETH  "KASRAH"  TILMEETHI 

TILMEETH  "DAMMAH"  TILMEETHO 

TILMEETH  "SUKOON"  TILMEETH 

Using  the  "Tanween"  effect  with  the  first  three  diacritics, 
the  same  word  is  pronounced  as  follows: 

•  with  "Fat_ha  tanween"         TELMEETHAN 
with  "Kasrah  tanween"  TELMEETHIN 

with  "Dammah  tanween"         TELMEETHON 
Shaddah  has  the  ability  to  be  used  with  all  the  above  except 
the  Sukoon. 

The  use  of  diacritics  removes  the  ambiguity  in  the 
reading  of  text.  It  is  powerful  enough  to  change  the 
meaning  of  the  sentence  completely.  The  vowelization  of 
verbs  by  diacritics  will  change  the  sentence  to  passive 
tense.  In  Arabic  the  verb  comes  before  the  noun.  So  in 
Arabic  the  two  statements,  'was  stolen  Ali  a  book,  '  and 
'stole  Ali  a  book'  without  the  use  of  diacritics,  especially 
on  the  verb,  could  not  be  distinguished.  The  effect  of  the 
"er"  and  "ee"  in  English  as  in  "employer/ employee"  is  also 
achieved  by  the  use  of  diacritics  in  Arabic  on  the  noun.   In 
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combination,  the  failure  to  use  diacritics  can  completely 
obscure  the  meaning  of  a  sentence.  For  example,  it  would  be 
as  if  in  the  sequenced  f ired/was_f ired  employee/er  we  did 
not  know  which  of  each  alternative  is  meant.  The  employee 
either  was  fired,  or  fired  someone.  On  the  other  hand,  the 
employer  either  was  fired,  or  fired  someone.  See  Figure  6 
for  some  examples  using  vowels  and  without  vowels. 

Clearly  one  can  see  the  need  of  diacritics.  In 
religious  and  history  texts,  they  are  used  extensively.  In 
an  international  symposium  for  standardization  of  character 
code  sets  and  keyboards  for  Arabic  language  in  computers 
held  on  1-4  June  1980,  several  proposals  were  presented  by 
researchers  and  companies  that  already  have  developed  their 
own  character  sets  [Refs.  1,2].  All  the  proposals  and 
recommendations  agreed  on  including  the  diacritics.  This 
use  of  diacritics  will  be  beneficial  in  the  use  of  data 
bases,  artificial  intelligence  and  educational  textbooks. 

C.   THE  CONTEXTUAL  ISSUES 

The  mere  presence  of  a  character  in  different  locations 
within  a  word  determines  the  shape  to  be  written  or  read. 
Should  the  computer  do  the  analysis  and  free  the  user  from 
worrying  about  a  large  complex  character  set,  or  should  the 
keyboard  contain  all  possible  variations  of  each  character 
and  have  the  user  learn  to  master  more  than  one  hundred 
strokes  for  the  alphabet  in  addition  to  numerals,  special 
characters,  and  punctuation? 
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One  popular  approach  is  to  provide  only  a  minimum  set  of 
required  characters,  usually  between  31  and  60  not  including 
diacritics,  numerals,  and  special  characters.  This  approach 
is  known  as  the  single  character  single  shape  keyboard.  The 
data  is  stored  in  memory  or  storage  devices  using  this 
reduced  code.  The  reduced  code  is  analyzed  by  an  interface 
to  give  the  right  form  or  shape.  The  interface  is  part  of 
the  display,  when  smart  displays  are  used,  or  a  shell  on  top 
of  the  "O.S."  to  contextually  analyze  the  character  form. 

The  issue  is  not  quite  settled  and  standardized  among 
all  Arabic  alphabet  users,  nor  Arabic  countries.  A  suc- 
cessful meeting  of  authorized  people  from  all  concerned 
countries  have  not  yet,  to  my  knowledge,  agreed  on  a 
standard.  A  few  companies  who  stepped  into  the  market  early 
have  generated  their  own  version  of  character  code  sets. 
Some  companies  have  realized  the  gap  between  their  early 
implementation  and  actual  language  needs.  The  gap  was 
realized  more  when  the  use  of  the  produce  was  not  utilized 
in  all  the  areas  and  aspects  for  which  it  was  designed. 
Some  companies  have  realized  that  the  survival  and  popu- 
larity of  their  product  depends  on  compatibility  with  at 
least  the  codes  of  a  character's  internal  representation. 
Some  companies  went  further  by  investing  in  research  for  an 
optimum  solution.  Language  experts  were  hired  and/or  con- 
sulted by  companies  like  IBM,  TI,  and  WANG.   The  companies 
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are  following  efforts  for  solutions  and  continuing  further 
the  research  to  achieve  an  effective  solution. 

In  resolving  the  multiple  character  shapes,  most  com- 
panies have  tried  some  reduction  of  all  possible  codes  to  a 
single  code  using  several  philosophies.  Texas  Instrument 
has  presented  [Ref.  1]  three  approaches  to  reduce  the  Arabic 
code. 

The  first  approach  was  called  "CORRESPONDENCE  &  DIFFER- 
ENCES." This  approach  divided  the  alphabet  into  groups. 
The  first  type  A  have  characters  with  one,  two,  or  three 
points  (Appendix  B) .  The  second  type  B  are  without  points. 
The  last  type  C  contains  characters  having  at  least  one  form 
of  each,  for  example  character  "RA"  and  "ZA."  The  two  char- 
acters have  the  same  form  with  a  point  on  the  "RA"  and  no 
point  on  the  "ZA."  The  idea  is  if  the  basic  form  has  one 
key  (code) ,  two  or  more  characters  will  have  the  same  basic 
form,  the  points  can  be  added  later. 

The  second  approach  was  called  "ROOTS  &  APPENDICES" 
(Appendix  B) .  The  approach  divided  the  alphabet  into 
groups.  Two  groups  have  six  characters  in  each.  Another 
group  has  four  characters.  Each  of  the  above  groups  have 
the  same  cursive  and  "APPENDICES."  The  "ROOT"  of  the  char- 
acter can  be  used  at  the  beginning  or  in  the  middle  of  a 
word.  One  appendix  will  complement  each  root  of  a  group. 
This  will,  require  a  total  of  seven  codes  for  a  group  of  six 
roots.   The  group  would  require  (for  six  characters,  each 
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with  three  contextual  forms)  a  total  of  18  separate  keys 
and/or  codes.  This  approach  implicitly  asks  for  more 
software  to  analyze  the  appendices.  A  character  may  be 
represented  by  two  codes  internally.  This  will  make  text 
storage  inefficient. 

The  last  approach  was  "CONTEXTUAL  ANALYSIS"  (Appendix 
B)  .  Texas  Instruments  has  developed  a  product  using  this 
approach.  The  DS990  Bilingual  System  can  handle 
Arabic/Latin  modes  and  display  them  on  a  screen  or  line 
printer.  The  contextual  analysis  approach,  in  all  the 
developments  seen  by  the  author,  uses  a  reduced  code  set. 
The  reduced  code  set  is  used  for  the  internal  representa- 
tion of  data.  Keyboard  keys  of  the  Arabic  set  are  kept  to  a 
minimum,  usually  the  basic  form.  A  software  interface 
analyzes  the  character  contextually  and  displays  the  charac- 
ters in  the  right  form.  This  interface  software  in  some 
application  is  pushed  further  away  from  the  responsibility 
of  the  CPU  to  the  display  terminals.  Such  terminals  are 
called  'SMART'  terminals.  TI's  DS990  system  diagram  (Appen- 
dix C)  shows  the  configuration  of  a  typical  system. 

TI  realized  the  need  for  diacritics  in  the  Arabic 
language  after  it  introduced  the  system  to  the  marketplace. 
TI,  at  an  international  symposium  held  in  Riyadh,  Saudi 
Arabia  between  1-4  June,  1980  [Ref.  l:p.  68],  in  an  effort 
at  standardization  of  code,  character  sets,  and  keyboards, 
recommended   that   the   Arabic   computer   systems   standards 
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requirement  include  the  use  of  diacritics.  This  is  an 
example  of  the  approach  of  the  pioneer  companies  who  had  to 
define  and  develop  the  alphabet  codes  set.  Premature 
standards  will  automatically  be  overriden  by  the  authorized 
agency.  The  DS990  did  not  handle  the  use  of  diacritics. 
Since  the  use  of  diacritics  was  adopted  by  all  standards 
committees,  this  lead  a  few  companies  to  follow  a  new 
standard  that  supports  diacritics. 

ALIS,  Inc.,  introduced  BCON  "^^  as  a  bilingual  operating 
system.  BCON  was  geared  toward  MS-DOS  based  microcomputers. 
The  bilingual  operating  system  is  an  interface  between  the 
operating  system  (O.S.)  and  different  applications  [Ref.  2]. 
This  bilingual  operating  system  adopted  the  single  key  or 
single  code  approach.  Each  character  is  represented  inter- 
nally in  memory  by  a  unique  code.  BCON  also  fully  supports 
the  use  of  diacritics  in  text.  The  single  code  approach,  as 
mentioned  before,  requires  that  a  device  or  an  interface 
(hardware  or  software)  properly  analyze  the  character  and 
display  the  correct  form.  BCON  uses  Application  Screen 
Image  Compensations  (ASIC)  to  perform  the  contextual  analy- 
sis. BCON  uses  separate  codes  and  fonts  for  each  character. 
The  internal  character  code  gets  translated  (mapped)  to  its 
output  code.  The  internal  code  has  4  to  5  output  codes. 
The  code  to  be  displayed  is  based  on  the  location  of  the 
character  within  the  word  (TI's  and  BCON's  system  will  be 
covered  in  more  detail  in  the  next  chapter) . 
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IV.   EFFORTS  TO  STANDARDIZE  CODES 

Several  nations  use  the  Arabic  alphabet  today,  both 
Arabic  speaking  nations  and  non-Arabic  speaking.  It  is  a 
political  challenge  to  gather  concerned  nations  and  succeed 
in  establishing  a  standardized  code  set  acceptable  to  all  of 
them.  It  is  difficult  for  any  one  country  to  take  the  ini- 
tiative and  responsibility  to  follow  such  a  program  until  it 
comes  to  life.  It  is  hard  for  a  single  country  to  conduct 
research  and  share  knowledge  with  another  country  that  is 
thousands  of  miles  away.  In  recent  years  as  cooperation 
between  Arab  nations  has  increased,  and  as  methods  of  com- 
munication have  improved,  as  well  as  travel,  there  have  been 
more  productive  meetings  and  symposiums.  Several  countries 
have  mutually  cooperated  to  work  and  develop  a  possible 
solution  to  the  standard  codes  set  for  Arabic  in  data 
processing. 

Many  countries  like  Kuwait,  Iraq,  Morocco,  and  Saudi 
Arabia  have  hosted  meetings  and  symposiums,  listening  to 
experts  on  the  language,  and  in  the  data  processing  field. 
Researchers,  as  well  as  company  representatives,  have 
brought  up  points  to  consider,  shared  their  experiences,  and 
given  recommendations.  Several  existing  systems  have  been 
developed  or  proposed  by  companies  or  individuals  in  the 
field.   The  countries  that  have  been  exposed  to  technology 
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and  are  more  developed  than  other  Arabic  nations,  have  an 
urgent  need  to  set  standards  in  general.  Countries  like 
Morocco  started  as  early  as  the  1950 's  to  set  standards  for 
printing  devices. 

The  north  African  countries  have  progressed  further  in 
this  research.  Morocco  shared  willingly  with  the  Arab 
nations  their  latest  research  and  developments  in  the  area. 
The  problem  of  choosing  an  existing  system,  with  some  or  no 
modification,  or  to  redefine  once  again  a  new  standard,  is 
also  a  political  issue. 

A.   SOLUTION  EFFORTS 

Several  companies  have  provided  results  of  their 
research  and  in  some  cases  have  implemented  systems,  giving 
recommendations  and  results  of  conducted  tests,  in  the  case 
of  keyboard  layout  proposals.  Companies  that  have  an 
interest  in  the  market  and  have  worked  in  the  Arabic  data 
processing  field,  have  no  authority  to  develop  a  code  set 
standard.  Government  representatives  are  the  authorized 
agency  to  do  such  a  task.  Several  companies  have  proceeded, 
given  a  lack  of  standards,  to  develop  Arabic  code  sets  and 
implement  them  on  hardware.  This  has  resulted  in  several 
incompatible  systems  of  code  sets.  Data  in  one  system  means 
different  things  in  another  code  set  system.  This  approach 
to  the  development  of  code  sets  has  both  disadvantages  and 
some  advantages  to  the  companies  involved. 
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Early  development  made  companies  as  well  as  users 
understand  the  weaknesses  of  the  developed  system.  For 
example,  TI '  s  DS990  system's  omission  of  diacritics  failed 
to  fulfill  the  needs  of  the  language.  On  the  other  hand,  by 
just  introducing  a  product  early,  companies  make  their  name 
familiar  to  customers.  The  customer  cannot  complain  about  a 
reasonable  attempt.  This  did  establish  a  good  reputation 
for  such  companies,  especially  when  they  adopt  the  approved 
standard  and  reintroduce  their  products.  In  addition  to 
developing  a  good  name,  they  gain  experience  in  the  process. 
This  will  help  in  introducing  an  earlier  product  complying 
with  the  standards.  So  a  company's  early  efforts  are  not  a 
total  waste. 

Since  early  implementation  ignored  including  diacritics 
use  with  text,  newer  designs  have  to  pay  special  attention 
to  their  use.  Data  base  machines  must  pay  attention  when 
sorting  and  searching.  The  representation  of  diacritics 
will  require  special  care  from  data  processing  machines. 
The  priority  of  characters  with  or  without  diacritics  must 
be  known  to  the  machine.  A  process  of  stripping  diacritics 
from  a  given  string  to  be  located  to  match  with  a  query, 
will  facilitate  the  search.  However,  the  target  of  the 
search,  when  found,  must  be  displayed,  and  stored  if 
updated,  in  the  vocalized  form.  Unlike  Texas  Instruments, 
IBM  chose  to  maintain  domination  in  the  market  for  type- 
writers and  Arabic  only  EDP  machines.    IBM  did  conduct 
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studies  on  their  own  in  an  effort  to  develop  a  code  set  and 

keyboard  layout.   IBM,  represented  by  Mr.  R.P.  Hajjar  and 

Dr.  A.M.  Ismail,  presented  their  attitude  toward  a  bilingual 

code  set  standard  at  the  symposium  held  in  Riyadh,  Saudi 

Arabia,  in  June  1980  [Ref.  l:p.  72]: 

Meanwhile,  competent  people  from  the  Arab  world  and  from 
elsewhere,  have  addressed  the  same  subject  and  came  up 
with  a  variety  of  solutions  that  are  not  compatible  with 
each  other,  due  to  the  fact  that  they  reflect  the  require- 
ments of  a  particular  Arab  country,  but  may  not  be  totally 
acceptable  by  the  neighboring  Arab  country.  This  is  the 
main  reason  why  IBM  has  not  implemented  such  solutions, 
but  will  look  forward  to  investigate  the  possibilities  of 
their  implementation,  in  case  these  solutions  are  adopted 
as  part  of  an  inter-Arab  standard. 

IBM,  TI ,  and  Wang  have  shared  their  research  and  willingness 

to  achieve  a  solution  and  adopt  it  in  their  products. 

This  chapter  will  briefly  cover  three  systems: 

-  TI  DS990  System 

-  ALIS  Inc.,  BCON  System 

-  ASV-CODAR  Proposed  System. 

B.   TI  DS990  BILINGUAL  SYSTEM 

DS990  is  a  bilingual  system  that  generates  seven  bits 
for  ASCII  codes  and  generates  an  8  bit  code  for  Arabic 
codes.  The  system  represents  the  Arabic  alphabet  with  3  2 
unique  codes  in  addition  to  13  special  characters.  The 
thirty-two  codes  are  the  internal  representations  of  the 
alphabet.  TI ' s  system  uses  the  one  key  many  shapes 
philosophy.  The  32  codes  are  the  basic  character  set  of  the 
system  (Appendix  C)  .    The  one  key  many  shapes  approach 
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requires  the  use  of  an  interface  with  a  smart  display  to 
display  the  correct  form  and  shape.  The  DS990  block  diagram 
(Appendix  C)  ,  shows  how  the  system  is  arranged.  The  3  2 
codes  are  mapped  to  128  less  13  giving  a  total  of  115  shapes 
that  can  be  displayed.  The  display  ROM  interface  contains 
all  128  shapes  (Appendix  C)  .  The  display  service  routine 
(DSR)  and  the  display  ROM  interface  contextually  analyze  the 
basic  code  set  and  display  the  data  correctly  by  mapping  one 
code  to  one  or  two  display  code(s). 

DS990  does  not  handle  diacritics.  It  also  increased  the 
optimum  set  from  31  to  3  2  unique  characters.  The  system 
considers  LAM  ALEF  as  a  single  character.  Two  clear  viola- 
tions. The  use  of  diacritics  is  a  must  in  data  processing. 
The  LAMALEF  (DC  hex  value  in  the  basic  character  set) 
(Appendix  C)  is  composed  of  the  character  LAM  (D6  hex) 
followed  by  the  character  ALEF  (CO  hex)  which  are  two 
separate  characters  and  should  not  have  a  unique  code.  The 
fact  that  the  table  shows  no  special  code  for  eastern  Hindu 
numerals  indicates  that  the  same  code  for  Arabic  numerals, 
known  as  western  Hindu,  is  used  for  both  representations 
(Figure  5)  .  Depending  on  the  display  mode,  the  eastern 
(Hindu)  and  the  Arabic  (western  Hindu)  are  displayed 
differently.  So  a  user  of  a  north  African  country  cannot 
use  the  western  Hindus  (known  as  Arabic  numerals)  in  Arabic 
mode.   This  is  not  desirable. 
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DS990  stores  information  in  memory  in  logical  order  in 
Latin  mode  and  Arabic  mode.  The  display  ROM  interface  and 
the  control  program  map  the  internal  representation  of  one 
code  to  one  or  two  display  codes.  For  example,  to  display 
the  character  'SEEN'  as  in  the  basic  character  set  (CB  hex 
value)  (Appendix  H)  ,  the  character  is  represented  by  two 
display  codes.  The  first  code  is  the  value  BC  hex  followed 
by  the  code  8B  hex  in  the  displa.y  ROM  interface  table. 

The  approach  followed  by  TI  is  the  typical  way  most 
companies  are  implementing  their  display  techniques.  How- 
ever, the  disadvantage  is  the  omission  of  diacritics  and 
considering  "LAMALEF"  as  one  character.  TI  has  indicated 
they  now  believe  the  implementation  must  have  diacritics. 
[Ref.  1] 

C.   ALIS  INC.,  BCON  SYSTEM 

ALIS  Inc. ,  introduced  BCON  "^^  as  a  bilingual  operating 
system  that  could  be  a  standard  to  follow,  or  at  least  close 
to  a  standard.  The  bilingual  operating  system  adopted  the 
single  key  single  code  approach.  Each  character  is  repre- 
sented by  a  unique  code  internally  in  memory.  BCON  also 
fully  supports  the  diacritics  use  in  text.  BCON  was  geared 
toward  MS-DOS  based  microcomputers.  The  bilingual  operat- 
ing system  is  an  interface  between  the  MSDOS  operating 
system  and  applications.  BCON  is  designed  to  facilitate  the 
adaptation  of  the  large  number  of  existing  MS-DOS 
applications  to  Arabic  [Ref.  2].   The  single  code  approach 
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as  mentioned  before  requires  that  some  device  or  interface 
(hardware  or  software)  properly  analyze  the  character  and 
display  the  correct  form.  BCON  uses  Application  Screen 
Image  Compensations  (ASIC)  to  per-form  the  contextual 
analysis,  and  then  selects  the  correct  display  code 
(Appendix  D) . 

1.   Hardware  and  Software  of  BCON 

BCON  hardware  is  another  board  on  top  of  the  Latin 
character  generator  board.  The  new  board  has  the  Arabic 
character  generator  with  the  required  wiring  to  allow  con- 
current operation  of  both  character  generators.  The  two 
boards  are  back  to  back  and  use  one  slot  on  the  mother 
board — a  microcomputer.  Keyboard  caps  (or  stickers)  are 
provided  for  use  on  the  keyboard.  The  stickers  have  both 
alphabets  printed  side  by  side. 

The  software  is  a  program  which  when  activated, 
resides  in  low  memory  and  uses  19k  bytes.  Once  BCON  is 
activated,  it  can  be  set  in  Latin  "native"  mode  or  Arabic 
mode.  The  only  way  to  free  memory  is  to  reset  the  system. 
Both  modes  of  the  operating  system  will  allow  bilingual 
insertion  in  the  appropriate  direction.  In  their  early 
version  (up  to  early  1985) ,  ALIS  introduced  a  reduced  code 
called  Arabic  Reduced  Code  Information  Interchange  (ARCH) . 
ARCH  is  the  internal  representation  of  the  characters  in 
memory  and  what  is  seen  by  the  operating  system. 
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2.   ARCH  Code  Set 

Arabic  Reduced  Code  for  Information  Interchange 
(ARCH)  is  ALIS's  early  attempt  to  define  a  code  set.  The 
reduced  code  (ARCH)  (Appendix  D)  is  the  internal  represen- 
tation codes  of  data  in  memory.  The  ALIS  reduced  code  is 
completely  different  from  early  proposals  for  a  target 
standard  set  proposed  by  ASMO  (further  details  will  be 
covered  in  the  next  section) .    . 

The  code  uses  the  graphic  characters  for  the  Arabic 
set.  By  assigning  one  to  the  8th  bit,  128  additional  codes 
are  available  for  Arabic  codes.  This  allows  the  BCON  bilin- 
gual system  to  mix  codes  and  use  both  ASCII  and  ARCH. 
There  are  46  different  codes  assigned  for  the  alphabet, 
starting  with  code  DO  hex  and  ending  with  FD  hex.  ARCH 
places  the  diacritics  early  in  the  table  to  give  them  pri- 
ority in  sorting  algorithms.  This  early  positioning  in  the 
table  was  not  favorable,  however.  The  reasoning  will  be 
discussed  when  the  standard  code  and  the  format  justifica- 
tion are  discussed.  The  escape  codes  and  special  characters 
should  not  be  redefined  for  ARCH  if  similar  ones  in  Latin 
exist.  This  minimizes  the  code  set  for  ARCH,  freeing  more 
code  for  future  expansion.  Codes  for  functional  codes  could 
be  minimized  by  using  the  international  one. 

ALIS  reduced  code  is  completely  different  from  early 
proposals  for  a  target  standard  set.  The  Arab  Organization 
for  Standardization  and  Metrology   (ASMO) ,   after  several 
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years  of  research  and  after  meeting  with  Arab  representa- 
tives, recommended  the  use  of  CODAR  U-F.D.  as  a  standard  for 
Arabic  codes  (further  details  will  be  covered  in  the  next 
section) .  Subsequently,  ALIS  and  other  companies  adopted 
the  new  code  set  in  order  to  assure  compatibility  with  other 
applications  and  implementations.  BCON's  original  version 
of  reduced  code  (ARCH)  (Appendix  D)  is  the  internal  repre- 
sentation of  information  in  memory. 

The  form  or  appearance  of  characters  is  not  a  major 
issue  as  in  how  it  should  be  displayed.  This  is  dependent 
on  the  machine  resolution  and  capabilities.  The  fonts  and 
style  of  displayed  texts  vary  from  one  machine  to  another. 
ASMO  has  recommended  that  the  style  of  displayed  text  be 
left  to  developers.  This  has  left  a  lot  of  room  for  manu- 
facturers to  be  creative  and  compete  for  quality  work  for 
the  benefit  of  the  user. 

3 .   Operating  Principles  of  BCON 

BCON,  once  loaded,  resides  in  memory  using  19k  of 
low  memory.  BCON  has  three  code  sets.  The  three  code  sets 
are:  reduced  code  (ARCH)  ,  key  code  and  display  code. 
Figure  7  shows  how  the  three  codes  are  integrated  with  each 
other.  A  list  of  the  three  code  sets  is  provided  in  Appen- 
dix D.  ARCH  includes  the  diacritics  as  a  part  of  the  code 
set.  This  was  set  as  a  requirement  of  the  CODAR  U-F.D. 
standards.  BCON  receives  the  key  code  and  stores  it  in 
memory  in  reduced  code  form.    The  reduced  code  form  is 
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analyzed  by  BCON  and  contextually  analyzed  and  displayed  in 
the  correct  form.  In  the  display  process,  BCON  appends  if 
necessary  what  is  called  "TAIL  GENERATION"  to  some  charac- 
ters if  they  fall  at  the  end  of  a  word  [Ref.  2]. 

The  early  work  on  BCON,  as  well  as  the  work  of  other 
companies,  must  be  modified  to  correspond  to  the  new 
standards.  ALIS  in  early  1986  introduced  a  new  mode  in 
addition  to  ARCH.  The  new  mode  uses  the  ASMO  approved  code 
set.  No  documents  are  available  at  this  time.  However,  as 
mentioned  before,  previous  effort  was  not  totally  lost.  The 
company  still  utilizes  the  contextual  analysis  developed 
earlier,  with  minor  modifications.  The  same  is  true  for 
their  printer  driver  software.  This  is  a  good  example  of 
how  early  development  enables  a  company  to  react  quickly  to 
new  demands. 

D.   ASV  CODAR-U  SYSTEM 

In  researching  the  early  efforts  initiated  by  official 
organizations  or  government  agencies  for  inter-Arab  unifi- 
cation of  the  codes  set,  two  names  were  always  associated: 
CODAR  and  Dr.  Lakhdar.   A  few  acronyms  are  important  here: 

CODAR  :   Code  Arabs  (French) 

ASV    :   Arabe  Standard  Voyelle  (French) 

lERA   :   Institute  d' Etudes  et  de  Recherchers 
I 'Arabisation 

IBI    :   Intergovernmental  Bureau  for  Informatics 

COARIN:   IBI  Committee  on  the  use  of  Arabic  in  Informatics 
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ALESCO:   Arab  League  Education  Cultural  and  Science 
Organization 

SASO   :   Saudi  Arabian  Standards  Organization 

ASMO   :   Arab  Organization  for  Standards  and  Metrology 

Dr.  Ahmed  Lakhdar  Gazal,  Director  of  lERA  (Institute  for 

Research  and  Studies  for  Arabization  in  Rabat,  Morocco)  has 

been  associated  with  the  CODAR  project  for  several  years. 

Dr.  Lakhdar  proposed  that  the  Arab  nations  adopt  the  CODAR 

system  as  a  standard  for  telecommunications.    lERA  was 

working  as  far  back  as  1955.   The  standardized  Arabic  Code 

was  a  dream  many  people  were  expecting  and  needed  for  many 

years.    However  they  have  no  power  over  defining  it  or 

making  it  official,  assuming  it  is  acceptable. 

The  CODAR  system  is  a  long-going  project  that  is  geared 

for  setting  standards  for  several  fields  of  interest.   The 

project  covers: 

PRINTING 

-  TYPEFACES 

-  TRANSFER  LETTERS,  SELF-ADHESIVE  TYPES 

-  SLUG-CASTING  MACHINES 

-  MOVABLE  TYPE  COMPOSITIONS -CASTER 

-  PHOTOCOMPOSITION 
TYPEWRITERS 

INFORMATICS  AND  DATA  TRANSMISSION 

TELECOMMUNICATIONS 
This  chapter  is  concerned  with  Informatics  and  Data  Trans- 
mission.    However,   a   lot  of  credit  must  be  given  to 
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personnel  behind  CODAR.  It  took  CODAR  a  lot  of  effort  and 
dedication  by  lERA's  staff  to  accomplish  a  unification.  A 
long  list  of  acknowledgments,  appreciation,  and  financial 
support  letters  were  coordinated  by  CODAR  from  several  coun- 
tries and  organizations.  A  list  of  participants  include: 
Moroccan  Ministry  of  Education  (1956) 

First  Conference  of  the  Arab  National  Commissions  for 
UNESCO  (1958) 

First  Conference  on  Arabization  (Rabat,  1961) 

UNESCO  (Arab  book-keeping  experts  meeting)  (Cairo,  1972) 

A  long  list  of  occasions  and  dates  are  listed  [Ref.  l:pp. 

207-210] . 

Under  Informatics  and  Data  Transmission  there  were  three 

versions  of  the  7-bit  code  system.   They  are: 

Seven  bit  CODAR  I  :    first   coding   scheme   of   the   ABV 

characters 

Seven  bit  CODAR  II:    a  proposition  for  a  unified  Arabic 

coding  scheme,  discussed  at  regional 
(IBI)  meeting  at  Bizzert,  Tunisia, 
June  1976 

Seven  bit  CODAR  U  :    unified  coding  scheme  for  the  Arab 

countries  proposed  by  COARIN  (IBI 
committee  on  the  use  of  Arabic  in 
informatics)  at  a  meeting  in  Rome, 
June  1977. 

The  seven  bit  CODAR  I,  CODAR  II,  and  CODAR  U  (Appendix 

E)  are  code  set  proposals.   CODAR  I  was  produced  by  EURAB 

and  the  printers  were  manufactured  by  the  Italian  firm  SELL 

CODAR  II  is  a  subsystem  of  CODAR  I.   The  subsystem  can  be 

obtained  by  removing  all  possible  combinations  of  "Harakat" 

(i.e..  Fat 'ha,  Kassrah,  and  Dammah)  with  the  "Shaddah."   The 
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subsystem  also  leaves  out  three  Persian  characters,  opening 
and  closing  square  brackets,  backslash  and  a  few  character 
variant  shapes. 

CODAR  U  fully  supports  vocalization  with  all  possible 
"Shaddah"  combinations  with  the  "Harakat."  This  system  is 
the  closest  to  being  acceptable  by  ASMO  and  approved  as  a 
standard.  ASMO's  approval  will  give  the  system  official 
status. 

E.   THE  STANDARDIZED  SET 

In  1980  CODAR  U  was  accepted  as  a  working  basis  for  a 
basic  code  set.  Recommendations  and  modifications  were  to 
be  presented  to  ASMO  in  order  to  formalize  the  code  set. 
The  next  step  was  to  distribute  it  to  ASMO's  members. 
Member  countries  insure  that  it  is  implemented  accurately. 

During  a  meeting  held  between  22-24  April  in  Rabat 
(Morocco) ,  the  final  code  for  the  proposed  standard,  called 
CODAR  U-F.D.  was  finalized  and  submitted  to  ASMO  along  with 
six  recommendations  (Appendix  F)  .  The  conference  recom- 
mended ASMO  to  distribute  and  test  the  code  by  lERA,  SASO, 
and  the  National  Center  for  Information  in  Tunisia  before 
enforcing  the  code.  ALESCO  and  ASMO  were  also  recommended 
to  make  every  effort  for  the  adoption  of  the  code  by  all 
Arab  countries. 

Finally,  on  October  21,  1982  ASMO  adopted  the  code  pre- 
pared by  IREA,  and  ALESCO.  This  code  was  the  result  of  the 
CODAR  U-F.D.   proposed   in  April,   1982   at   Rabat.     The 
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modifications  and  changes  are  included  (Appendix  G) .  There 
are  a  few  points  to  consider.  There  are  31  codes  for  the 
alphabets,  3  codes  for  "Harakat,"  2  codes  for  "Shaddah"  and 
"sukoon,"  5  codes  for  "Hammzah,"  3  codes  for  "Tanween," 
totalling  44  codes.  Their  location  must  not  be  changed  in 
the  table  under  any  circumstances.  The  "Hamzah"  in  all 
variations,  on  top  or  under  characters,  are  considered  forms 
of  "Hammzah."  The  "Hamzah"  is  placed  in  the  beginning  of 
the  code  table,  which  in  searching  means  any  character  with 
"Hamzah"  associated  with  it  should  be  expected  higher  in 
order  (equivalent  to  "A"  in  Latin) .  This  concept  will  con- 
fuse users  when  searching  or  sorting.  The  results  may  be 
surprising  for  sorting  algorithms.  In  sorting,  the  table 
allows  a  simple  sort.  Errors  will  result  from  the  occur- 
rence of  diacritics  and  the  code  60  hex  in  the  table  (6/0) . 
The  code  60  hex  is  used  for  connection  or  extending  a  word 
for  formatting  purposes.  So  a  sorting  algorithm  should 
strip  text  of  the  diacritics  and  the  connection  dash 
(similar  to  Latin  underscore)  first,  then  sort  the  text 
according  to  the  basic  31  character  code.  The  user  must  be 
educated  about  all  the  remarks  mentioned  in  the  reasoning  in 
ASMO's  final  form  of  the  code  set.  Another  convention  was 
that  the  character  comes  first  in  words  that  are  vocalized. 
The  form  to  follow  is: 

WORD  ::=  (  <CHARACTER>  <SHADDAH>  <DIACRITICS>} * 
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So  the  "Shaddah"  comes  before  the  diacritics  if  used  for  a 
character.  The  second  convention  is  if  the  pure  word 
matches  in  sorting,  the  diacritics  then  should  be  used  by 
the  sorting  algorithm  as  qualifiers.  In  my  opinion,  this 
violates  the  Regularity  Principle  in  programming,  where  the 
user  must  be  concerned  and  remember  all  the  exceptions. 
This  does  not  in  any  way  mean  there  is  an  easier  way. 

F.   CONCLUSION 

The  ASMO  code  set  is  the  standard  Arabic  code  set  the 
Arab  countries  must  enforce  in  their  countries. 
Subsequently  all  companies  in  the  area  must  adopt  and  use  a 
standard  code  set.  The  competition  is  now  directed  toward 
improving  the  display  application  with  high  resolution  and 
graphic  capabilities.  Printing  devices  also  are  an  area  for 
manufacturers  to  compete  in  printing  different  Arabic  styles 
and  fonts.  The  contextual  issue  is  left  as  a  flexible  issue 
to  the  implementors  to  research  and  develop  for  their  indi- 
vidual products.  The  display  form  of  text  on  monitors  and 
printing  devices  will  not  affect  the  internal  representation 
of  the  data,  which  must  be  compatible  with  the  standard  code 
set.  This  may  result  in  several  display  sets  developed  by 
the  companies  as  their  view  and  intention  of  displaying  a 
good  Arabic  text.  Hopefully  this  should  create  a  stable 
base  to  work  with  and  encourage  development  of  products 
based  on  the  ASMO  standards  and  conventions  listed  in 
Appendix  G. 
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V.   INTERFACE  DESIGN  GENERAL  APPROACH 

The  lexical  translator  will  generate  Latin  code  from  an 
Arabic  source  code  in  Pascal  syntax.  The  Pascal  compiler 
can  compile/ run  the  Latin  code  to  generate  an  output.  The 
interface  will  generate  a  correct  Latin  code  given  that  the 
Arabic  source  code  is  in  correct  syntax.  The  translator 
will  give  minimum  help  to  correct  the  Arabic  code.  The  user 
must  understand  the  syntax  and  the  semantics  of  the  language 
to  write  correct  source  code.  The  interface  is  not  an 
interactive  type  of  translator.  The  design  is  generally  the 
same  for  all  Pascal  compilers.  The  interface  must  always 
consider  the  environment  it  will  work  in.  The  interface  has 
two  environments  to  consider:  the  source  code  bilingual 
system,  and  the  compiler  environment.  From  the  portability 
and  compatibility  point  of  view,  the  translator  will  be 
limited  to  a  particular  Arabic  standard,  and  a  particular 
PASCAL  implementation. 

The  bilingual  implementation  has  its  own  function  codes. 
Those  codes  are  embedded  within  the  Arabic  source  code,  if 
generated  under  the  bilingual  operating  system.  The 
bilingual  operating  system  used  here  is  BCON  from  ALIS,  Inc. 
There  is  a  list  of  function  codes  in  Appendix  D.  The  PASCAL 
compiler  used  here  is  TURBO  PASCAL  from  Borland,  Inc. 
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The  Arabic  implementation  utilizes  the  upper  half  of  the 
255  character  set  used  by  graphics  to  display  Arabic  fonts. 
Some  Pascal  compilers  will  accept  any  of  the  255  characters 
as  legal  characters  for  use  in  string  data.  Turbo  Pascal, 
for  example,  allows  the  entire  set  of  255  characters.  This 
is  one  reason  why  Turbo  Pascal  is  used  in  this  thesis  as  a 
target  environment  for  the  generated  code.  The  interface 
will,  however,  generate  a  correct  PASCAL  code  even  if  the 
source  code  follows  standard  Pascal. 

The  compiler  will  always  refer  to  the  Turbo  Pascal 
compiler  even  though,  from  a  theoretical  point  of  view,  it 
should  be  any  Pascal  compiler.  Similarly,  since  there  is  no 
standard  representation  of  Arabic  data,  i.e.,  available  and 
implemented,  we  use  the  BCON  operating  system,  using  ARCH, 
as  the  internal  representation  of  data  in  memory. 

A.   MAJOR  CONCEPTS 

The  interface  looks  at  any  piece  of  code  (token)  as  one 
of  several  types.   These  types  are: 

-  Literal  string 

-  Comment 

-  Integer 

-  Identifier 

-  Functional  operator. 

Literal  strings  are  constants  and  the  interface  does  not 
alter  the  ASCII  value.  The  comments  are  surrounded  by  •(*• 
and  '*)'  in  Arabic  equivalent  codes.   Integers  are  important 
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and  easy  to  handle  since  there  is  an  isomorphic  relationship 
between  Arabic  integer  tokens  and  Latin.  A  real  number 
token  is  made  up  of  two  integer  tokens  separated  by  a  func- 
tional operator.  An  identifier  is  any  legal  name  in  Pascal, 
either  a  reserved  word  or  user-defined.  Functional 
operators  are  all  the  codes  that  are  used  for  addition, 
brackets,  pointer  arrows,  etc.  In  setting  the  specification 
for  programming  in  Arabic  Pascal,  the  optimum  goal  is  to 
have  a  one-to-one  relationship  between  the  Latin  and  the 
Arabic  special  characters.  Also  we  want  to  avoid  overload- 
ing the  use  of  special  characters. 

1.  Literal  Strings 

Literal  strings  are  used  for  assigning  into  string 
variables  and  for  read  and  write  commands.  Strings  are  used 
to  interact  with  the  user  in  an  application  and  understand 
the  performance  of  the  program.  Therefore  we  do  not  alter 
these  strings.  The  literal  string  is  any  string  of  charac- 
ters surrounded  by  single  or  double  quotes.  It  is  the  pro- 
grammer's responsibility  to  verify  the  content  of  an 
assigned  string.  The  literal  string  can  have  any  character 
of  the  entire  set  80  hex  ...  FF  hex. 

2 .  Comments 

The  comment  length  is  limited  to  one  line.  The  com- 
ment is  enclosed  by  an  opening  bracket  followed  by  an  aster- 
isk, and  ends  with  an  asterisk  followed  by  a  closing 
bracket.   When  the  translator  encounters  the  beginning  of  a 
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comment  it  looks  for  the  end  of  the  comment.  The  comment  is 
considered  as  one  token.  The  translator  will  not  alter  the 
content  of  the  comment  since  it  is  for  the  use  of  the  pro- 
grammer only. 

3 .  Integers 

Integers  are  any  consecutive  digits  from  0-9  with  no 
separation  in  between.  For  example,  the  integer  printing 
format  "2245:6"  is  considered  as  three  tokens  as  far  as  the 
translator  is  concerned.  The  first  token  is  the  integer 
"2245,"  the  second  is  functional  operator  ":",  the  third  is 
the  integer  "6"  token. 

Real  numbers  are  made  up  of  three  parts  as  one  would 
expect.  They  are  integer  token,  Arabic  numeric  comma,  and 
integer  token. 

4 .  Identifiers 

All  legal  Pascal  names  fall  under  this  category. 
This  includes  reserved  words,  and  variable  names.  The  token 
is  identified  first  as  an  identifier,  then  looked  up  in  the 
reserved  words  group.  If  it  is  not  in  the  list  then  it  is  a 
variable  name.  Variable  names  include  variables,  labels, 
procedure  and  function  names.  When  an  identifier  is  encoun- 
tered and  it  is  not  a  reserved  word,  then  it  is  given  an 
identifier  number.  The  identifier  number  is  stored  with 
other  information  about  the  token  in  a  hashing  scheme  in  a 
symbol  table.  The  token  is  looked  up  in  the  symbol  table. 
If  it  is  not  entered,   then  it  will  be  entered  in  the 
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beginning  of  the  link  list  of  the  same  hash  key.  Since  the 
primary  user  of  the  translated  code  is  the  compiler,  the 
program  will  have  meaningless  variable  names.  However,  the 
translator  will  generate  a  file  called  "DICTIONARY"  contain- 
ing each  identifier  number  and  the  Arabic  token  associated 
with  it. 

5.   Functional  Operator 

Tokens  are  identified  by  separators  and  terminators. . 
Blanks  are  separators,  as  well  as  other  codes  that  have  a 
function  other  than  being  separators.  For  example,  the  plus 
and  minus  sign  as  well  as  the  up_arrow  symbol  in  PASCAL  are 
separators.  If,  for  example,  the  variable  root^ . left_sun 
was  the  Arabic  token  it  will  be  translated  into  something 
like,  id_l^.id_2,  where  the  identifier  numbers  are  entered 
for  the  Arabic  tokens. 

The  scope  of  the  variables  will  distinguish  fre- 
quently occurring  variable  names.  If  id_l  occurred  in  two 
declarations,   the  compiler  will  distinguish  between  two 

occurrences  of  id 1,  depending  on  the  location  of  the 

declarations.  Therefore  the  translator  does  not  need  to 
concern  itself  with  multiple  uses  of  the  same  name. 

B.   OPERATING  PRINCIPLES 

The  translator  goes  through  several  phases  and  each 
phase  has  a  sub-task.  The  process  begins  with  the  name  of 
the  Arabic  source  code  file.  The  file  is  opened,  the  target 
output  file  is  initialized  and  a  dictionary  table  file  is 
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opened.  The  second  phase  fills  a  buffer  with  a  code  segment 
of  the  source  code,  a  line  at  a  time.  The  line  is  broken 
into  tokens.  Each  token  is  given  a  type  and  then 
translated.  The  cycle  is  repeated  for  each  lineup  to  the 
end  of  the  source  file. 

1.  File  Opening  and  Initializing  Phase 

The  program  starts  with  the  prompt  for  the  user  to 
input  the  source  file  name.  The  file  name  is  checked  for 
existence  and  then  reset  for  reading.  The  file  name  is  used 
to  open  two  more  files,  the  dictionary  file,  and  the  output 
file.  The  initialization  is  concerned  with  the  hash  table 
that  has  information  regarding  the  record  structure  of  the 
identifier's  symbol  table.  The  rest  of  the  parameters  are 
optional  features  such  as  to  list  the  source  comments  with 
the  output  code.  Another  feature  is  debugging  for  tracing 
the  program  in  the  translation  while  the  translator  is 
scanning  and  translating  the  source  code.  Both  comments  and 
debugging  features  should  be  easily  set  at  any  point  in  the 
source  code.  The  rest  of  the  parameters,  for  example,  line 
number,  identifier  number,  are  initialized. 

2 .  Reading  and  Decomposing  the  Source  Code 

An  input  buffer  is  filled  from  the  source  code  and 
scanned.  A  line  at  a  time  is  read  from  the  buffer  and 
checked  for  special  instructions  (directives)  for  the 
translator.  If  the  line  is  not  a  directive,  it  is  checked 
to  see  if  it  is  a  comment.    If  the  line  is  a  comment  or 
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starts  with  one,  then  the  comment  is  either  omitted  or 
written  out  depending  on  the  comment  option.  The  comment 
option  is  a  Boolean  variable  set  by  the  user  within  the 
program  source  code,  to  either  omit  or  write  out  the  comment 
tokens  in  the  generated  file.  The  line,  or  the  remainder  of 
the  line  then,  is  decomposed  into  tokens.  Tokens  are 
identifiers,  integers,  blanks,  or  special  characters. 
Identifiers  are  either  reserved  words  or  user-defined 
identifiers.  Reseirved  words  are  matched  with  their 
associate  Latin  reserved  word.  User-defined  identifiers  are 
given  a  label  number  in  the  sequence  of  their  first 
appearance,  if  it  does  not  already  exist.  Integer  tokens 
are  scanned  and  each  digit  is  mapped  into  its  matching  Latin 
digit.  Special  characters  are  given  their  equivalent  Latin 
characters,  such  as  Arabic  and  Latin  semicolon.  Blanks  are 
copied  as  it  makes  for  better  formatting  of  the  generated 
code. 

The  investigation  of  the  token  type  is  based  on  the 
first  character  of  the  next  token  in  the  input  buffer.  For 
example,  if  the  first  character  is  a: 

-  Letter:    Then  investigate  the  possibility  that  it  is  an 

identifier. 

-  Digit:     The  token  must  be  an  integer. 

-  Other:     Then  it  must  be  a  special  character. 

In  this  phase  only  the  identifiers  are  translated.  When  a 
user-defined  identifier  is  encountered,  and,  if  it  has  not 
previously  been  recognized,  it  is  given  the  next  identifier 
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number  in  sequence.  Reserved  word  tokens  are  stored  in  a 
constant  table,  in  a  record  format.  Each  record  has  an 
Arabic  word  and  the  matching  Latin  one.  Any  identifier 
token  is  first  looked  up  in  the  table.  If  found  then  the 
index  of  the  matched  record  is  passed  back  to  the  main 
program.  The  integer  tokens  are  given  the  type  integer  and 
passed  back  to  the  main  program.  If  any  of  the  above  is  not 
true  then  we  get  one  character  and  pass  it  individually. 

In  short,  each  token  is  given  a  token  type,  length, 
and  passed  back  to  the  main  program.  Reserved  words  are 
passed  back  with  the  match  index  additionally.  Identifiers 
are  also  inserted  in  the  symbol  table.  If  not  found,  their 
identifier  number  (in  Latin  characters)  is  passed  back. 
3 .   Token  Translation  Phase 

The  tokens  are  translated  into  Latin-based  on  the 
token  type.  The  integer  tokens  are  translated  by  mapping 
each  Arabic  (Eastern  Hindu)  digit  into  its  Latin  (Western 
Hindu)  associated  digit.  Reserved  word  tokens  are 
translated  by  writing  their  matched  Latin  reserved  word, 
using  the  match  index  found  earlier.  User-defined  identi- 
fiers are  replaced  by  the  identifier  number  assigned  to  it. 
The  rest  of  the  special  characters  are  looked  up  in  a  "CASE 
OF"  (a  PASCAL  control  statement)  list  or  assigned  into  a 
constant  table  (array) .  This  model  uses  a  case  statement. 
As  each  user  identifier  is  trans-lated  and  written  out  in 
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the  output  file,  it  is  also  written  out  in  the  dictionary 
table  along  with  the  Arabic  token  associated  with  it. 
4 .   File  Closing  and  Ending 

The  last  phase  is  to  close  the  source  file,  diction- 
ary, and  the  generated  output  file.  This  phase  will  only  be 
reached  at  normal  program  execution.  The  program  will  ter- 
minate if  there  is  a  character  code  not  in  the  range  of  the 
Arabic  alphabet  defined  by  the  bilingual  operating  system.. 
Long  tokens  and  comments  will  cause  errors  and  should  stop 
the  translation,  since  translating  a  comment  makes  no  sense. 

C.   DESIGN  GOALS 

The  interface  is  supposed  to  generate  from  any  Arabic 
source  code  a  Latin  code  in  PASCAL  syntax.  The  Arabic  pro- 
grammer must  master  PASCAL  programming  in  his  native 
language.  Essentially  little  syntax  and  no  semantic 
checking  will  be  performed  on  the  source  code.  The  com- 
piler job  is  to  scan  and  perform  the  syntax  and  semantics  on 
the  translated  code.  Some  help  must  be  provided  for 
tracing,  and  debugging  should  be  incorporated  into  such  an 
interface.  The  compiler  gives  the  error  messages  in  Latin. 
This  could  be  utilized  in  several  ways.  One  way  is  to  keep 
the  line  numbers  of  the  source  code  and  the  generated  code 
as  close  as  possible.  The  error  messages  usually  are  stored 
in  a  text  file  and  can  be  translated.  This,  along  with  the 
line  number  of  the  error  location,  can  be  combined  to  give 
the  location  and  type  of  the  source  code  error. 
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A  second  way,  if  the  error  messages  cannot  be  translated 
in  their  file,  is  to  translate  the  error  messages  and  return 
them  out  with  the  error  number.  The  Arabic  programmer  can 
look  up  the  error  number  in  Latin  and  the  line  number  of  the 
error,  then  look  up  the  translation  of  the  error  and 
explanations.  In  both  ways  a  few  hints  regarding  the  errors 
and  possible  causes  should  be  provided  to  the  user. 

D.   DESIGN  LIMITATIONS 

The  design  does  not  use  or  handle  diacritics  at  all  as 
far  as  reserved  words  are  concerned.  This  could  cause  error 
and  personal  interpretations  of  how  the  reserved  word  is 
written.  Since  most  reserved  words  are  clear  once  read,  the 
user  must  not  type  any  vowels  with  the  reserved  words  in  the 
program.  Similarly,  to  not  duplicate  the  translation  of  a 
single  user-defined  identifier,  and  eliminate  the  complica- 
tion of  debugging  of  such  cases,  the  user  should  not  use  the 
vowels  in  his  defined  identifiers.  The  diacritics  may  be 
used  in  literal  strings  and  headings  of  reports.  Several 
factors  may  affect  and  prevent  the  use  of  diacritics.  Some 
sorting  routines  sort  independently  of  diacritics.  Since 
vowel izat ion  can  upset  the  sorting  order  and  the  rules  for 
sorting  the  same  name  with  different  vowelization.  A  second 
reason  is  that  the  location  of  the  vowelization  of  the 
character  is  not  standardized.  A  third  reason  is  that  the 
resolution  of  terminals  is  poor  and  hard  on  the  eye  to 
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distinguish,   for  example,   between  the  "FAT 'HA"   and  the 
single  quote  symbol  in  printed  or  displayed  form. 

The  design  therefore  will  not  handle  vowels  in  the 
Arabic  source  code.  However,  it  should  be  noted  that  the 
option  of  including  the  diacritics  requires  few  changes  in 
the  design,  and  a  lot  of  attention  from  the  Arabic  pro- 
grammer. The  attention  is  required  to  rewrite  his  own 
sorting  routine  that  sets  the  ARCH  value  for  the  vowelized 
source  code.  Also  the  programmer  must  be  consistent  with 
his  use  of  vowels  with  identifiers  for  the  above  reasons. 

The  display  and  print  justifications  cannot  be 
controlled  easily  within  the  program  since  the  bilingual 
operating  system  does  not  use  a  standard  unified  code  for 
Arabic  display  and  print  mode.  For  example,  in  BCON,  the 
operating  system  used  for  the  implementation  of  this  thesis, 
if  you  are  editing  an  Arabic  screen  mode  then  the  curser  in 
the  entire  code  will  start  at  the  far  right  of  the  screen. 
This  right  justification  is  for  the  Arabic  format  and  inden- 
tation in  Arabic  texts.  Therefore,  if  you  exit  the  editor 
you  must  set  the  screen  mode  to  Latin  screen  mode,  otherwise 
the  "C:>"  prompt  will  be  displayed  in  the  far  right  of  the 
screen.  So  for  the  sake  of  simplicity  to  the  user  and 
consistency  on  the  behalf  of  the  generated  codes,  the 
display  codes  are  left  out  of  the  translator  control  and  are 
under  the  control  of  the  display  system  of  the  bilingual 
operating  system.    The  modes  can  be  set  with  an  external 
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escape  code  to  the  printer  or  a  sequence  of  key  strokes  to 
set  the  screen  to  Arabic  mode. 

These  limitations  can  be  resolved  once  there  is  a 
standard  set.  I  believe  the  bilingual  operating  system 
should  by  default  handle  the  justification  issue,  and  allow 
the  user  to  turn  this  option  off.  This  is  in  the  range  of 
two  to  five  years  to  come  in  the  industry  involved  with 
Arabic  text  handling.  . 
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VI.   PROGRAM  MODEL 

A.  INTRODUCTION 

The  Lexical  Translator  program  is  intended  to  be  simple, 
flexible,  and  to  demonstrate  feasibility  of  the  concept. 
Speed  and  efficiency  was  not  a  primary  goal.  Features  can 
be  added  as  needed  based  on  the  response  of  users  of  the 
program. 

The  program  will  require  the  supervision  of  a  good 
PASCAL  programmer  to  assist  the  compilation  and  execution  of 
the  translated  code.  The  assistance  could  be  achieved  by 
simple  detailed  instructions  on  how  to  use  the  program  to 
generate  output  code. 

B.  PROGRAM  ENVIRONMENT 

The  Translator  is  developed  under  a  certain  environment, 
and  until  there  is  a  unified  standard  for  a  bilingual 
operating  system,  program  portability  and  compatibility  will 
be  limited. 

1.   Hardware  Environment 

The  program  is  developed  using  an  IBM  XT  personal 
computer.  It  can  be  just  as  well  developed  using  an  IBM  PC 
Jr. ,  or  IBM  At.  The  IBM  XT  has  640  kilobytes  of  RAM  memory, 
20  megabyte  hard  disk,  two  half  height  floppy  disks,  and  the 
ALIS  Inc.,  graphics  board.  The  board  is  made  up  of  two 
boards  back  to  back.   The  first  board  is  a  Paradise  color 
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graphics  board.  The  second  board  is  on  top  of  the  paradise 
board  and  it  has  the  Arabic  character  .generator  and  the 
necessary  connection  circuitry  needed.  The  two  boards  fit 
in  one  slot  on  the  mother  board  of  the  XT  computer. 

The  keyboard  is  an  IBM  PC  keyboard  with  cap  stickers 
for  the  keys.  Each  sticker  has  two  to  four  different 
characters,  for  Arabic  and  Latin.  The  keyboard  layout  is 
displayed  in  Appendix  D.        - 

An  Epson  FX  85  dot  matrix  printer  is  used  for  the 
listing  of  the  program.  The  printer  has  an  Arabic  driver  to 
display  Arabic  characters. 

2 .   Software  Environment 

ALIS  Inc. ,  BCON  bilingual  operating  system  was  used 
in  developing  the  thesis  program  and  test  runs.  BCON 
resides  in  low  memory  using  about  2 OK  bytes.  The  BCON  is 
supposed  to  be  transparent  to  the  DOS  operating  system.  DOS 
stands  for  Disk  Operating  System  used  by  IBM  microcomputers. 
The  BCON  operating  system  requires  special  skill  and  more 
than  average  user  knowledge.  BCON  is  mainly  required  for 
generating  the  Arabic  fonts,  and  interpreting  and  mapping 
the  key  strokes  to  their  associated  ARCH  values.  The 
interpretation  and  mapping  are  performed  under  the  Arabic 
mode  only.  The  Arabic  characters  are  stored  as  hex  values 
ranging  from  80  hex  up  to  FF  hex.  This  range  of  values  is 
reserved  for  graphics  under  the  DOS  operating  system.   This 
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means  any  Arabic  character  code  is  considered  a  graphic 
character  in  the  absence  of  BCON. 

An  important  concept  must  be  pointed  out.  The 
presence  of  BCON  is  to  display  the  right  form,  font,  and  the 
indentation  of  Arabic  text.  So  with  minimum  skill,  a  pro- 
grammer can  develop,  review,  correct  Arabic  characters  in 
any  DOS  compatible  machine.  Then  the  result  can  be  dis- 
played under  BCON,  where  BCON  . can  interpret  the  graphics 
character  as  ARCH  code,  and  display  the  correct  textual 
form  of  the  ARCH  code  by  sending  the  appropriate  display 
code  to  the  terminal  or  the  printer. 

When  writing  long  Arabic  texts,  it  is  much  easier  to 
do  so  under  BCON,  with  the  aid  of  an  Arabic  word  processor. 
The  simple  EDLIN  editor  available  on  DOS  distribution  disk, 
or  Turbo  PASCAL  editor  of  version  2.1  and  below,  will  work 
also.  There  is  some  limitation  to  what  one  can  use  under 
BCON  and  still  display  Arabic  characters.  BCON  requires  two 
conditions  for  compatibility  when  using  any  application. 
First  BIOS  interrupts^  16  Hex  and  10  Hex  are  called  to 
access  the  keyboard  and  the  screen  respectively.  Second, 
the  application  must  handle  8-bit  characters.  [Ref.  2:  p. 
3-1] 

Turbo  PASCAL  version  2.1  was  used  to  write  the  main 
program  and  resource  file.    The  printer  interface,  called 


^Information  about  the  interrupts  can  be  found  in  DOS 
technical  manuals  for  personal  computers. 
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MPD  by  ALIS  [Ref.  2],  is  implemented  for  several  printers. 
The  name  stands  for  Multi  Printer  Driver.  The  MPD  was  used 
to  drive  the  Epson  FX  85  to  display  the  Arabic  characters  in 
the  program  listings,  and  sample  tests  (Appendices  H,  I) . 

C.   PROGRAM  BODY 

The  Lexical  translator  is  designed  to  be  easily  modi- 
fied and  should  be  done  when  the  updated  version  of  BOON 
utilizing  the  unified  standard  code  set  is  available.  The 
program  is  modular  and  could  be  rewritten  in  "C"  or  FORTRAN. 
The  program  is  designed  to  generate  a  correct  output  file 
from  a  correct  input  source  file.  The  program  will  not 
interpret  the  result  and  the  programmer  must  exercise  crea- 
tivity and  care  as  his/her  programming  advances,  to  assure 
correct  results  and  clear  output. 

The  printable  output  of  any  developed  program  is  either 
a  string  of  characters,  or  mathematical  results.  Since  any 
string  assignment  is  not  altered,  this  will  result  in  no 
difficulties  for  string  output.  If  the  result  is  a  real  or 
integer  number,  the  result  will  be  displayed  based  on  the 
BCON  digit  mode.  The  program  did  not  concern  itself  with 
numerals  since  all  the  users  are  familiar  with  the  Western 
Hindu  Numerals  (Latin) .  Also,  BCON  has  an  option  that 
allows  the  user  to  swap  the  digits  in  the  operating  system 
environment.  So  for  BCON,  analyzing  the  results  of  numeric 
calculation  will  be  duplicating  the  same  work.  This  may  be 
a  limitation  under  an  operating  system  other  than  BCON. 
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1.  Program  Files 

The  program  has  two  main  files  that  are  used  for  the 
generation  of  the  output  code.  The  main  file  and  the 
resource  file.  The  main  file  contains  constant  declara- 
tions, data  structure  declaration,  variable  declarations, 
procedures  and  functions,  and  main  program  body. 

The  resource  file  has  the  assignments  of  a  constant 
array  declared  in  the  main  program  and  is  used  as  an  include 
file.  The  resource  file  has  a  subset  of  the  reserved  words 
and  standard  function  names.  The  resource  file  is  a  very 
useful  modular  concept  since  you  can  replace  the  PASCAL 
resource  file  with  one  for  the  language  "C".  With  minimum 
changes  in  the  constants  and  directives  one  could  use  one 
Translator  with  several  resource  files,  one  for  each 
language,  to  Lexically  translate  from  Arabic  to  one  of  many 
Latin  compilers  syntax.  This  program  focus  is  on  the  Turbo 
PASCAL  syntax. 

2 .  Generated  Files 

The  translator  will  generate  two  files: 

-  A   Dictionary   file   with   the   same   name   and   "DIC" 
extension. 

-  An  Output   file  with  the  same  file  name  and  "PAS" 
extension. 

The  program  will  generate  the  desired  output  in  the  "PAS" 

file.    The  dictionary  file  will  be  updated  each  time  an 

identifier  is  encountered  for  the  first  time.   User-defined 
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Arabic  identifiers  are  translated  to  identifies  of  the  form 
"id_000  ...  id_999." 

3 .   Key  Variables  and  Data  Structure  Declarations 

The  external  file  "Resource. Pas"  is  an  assignment  of 
a  constant  array.  Each  element  of  the  array  is  a  record. 
The  record  has  two  components.  The  first  component  is  the 
Latin  reserved  word  or  function  name,  and  the  second 
component  is  the  Arabic  translated  (matching)  word.-^ 

The  user-defined  identifiers  are  handled  by  a 
hashing  scheme  and  a  symbol  table.  The  decision  was  to 
demonstrate  an  efficient  way  to  store  and  retrieve  identi- 
fiers. The  lexical  translator  will  be  constantly  looking  up 
any  non-reserved  identifier  in  a  symbol  table  to  insert  it 
or  to  get  its  Latin  match  if  predefined.  To  improve  effi- 
ciency, the  program  uses  a  direct  chain  Hashing  scheme  [Ref. 
3:p.  45]. 

The  identifier  is  passed  to  a  function  and  given  a 

key  number  by  Function KEY,    With  a  hashing  formula  the 

function  calculates  the  key  number  of  the  identifier.  The 
key  number  is  a  location  in  the  Hash  table.   The  content  of 

this  specific  location  is  pointer  to  a  word record  which 

either  contains  the  word  or  is  where  a  new  record  should  be 
inserted  in  case  the  word  was  not  found.  Words  having  the 
same  key  number  will  be  linked  together  in  a  linked  list. 


-^The  translation  is  in  no  way  a  standard  or  profes- 
sionally translated.    The  translation  was  made  for  demon- 
stration purposes. 
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The  incident  of  having  several  words  with  the  same  key- 
number  decreases  the  efficiency  of  Hashing  (see  Ref.  3  on 
how  to  avoid  Hashing  collision  and  when  to  use  Hashing)  . 
The  word  record  has  the  following: 

Id_No       -    the  identifier  number  in  the  sequence  of 

insertion. 

Length      -    number  of  characters  of  the  identifier. 

Lastchar    -    location  of  the  last  character  in  the 

symbol  table.- 

Nextword    -    pointer  to  the  next  identifier  with  the 

same  key  number. 

Latin_Id    -    the   Latin   identifier   assigned   to   the 

identifier. 

With  the  above  word  (identifier)  information,  we  can  locate 

the  word  in  the  symbol  table.    The  spelling  table  is 

declared  as  an  array  of  5000  characters.   The  size  is  an 

estimate  and  can  be  changed  as  one  can  predict  a  closer 

estimate.   The  symbol  table  is  implemented  as  a  linked  list 

and  its  size  can  vary  dynamically  so  as  to  be  as  large  as 

necessary. 

The  translator  looks  for  tokens  using  two  methods. 
The  first  method  uses  a  pair  of  delimiters  to  identify  the 
token.  The  pair  define  the  beginning  and  end  of  a  token. 
Token  classes  that  can  be  identified  by  this  method  are 
comments,  literal  strings,  and  directives. 

The  second  method  recognizes  a  token  by  its  first 
character.  Examples  of  this  class  are  integers,  and  identi- 
fiers.  The  second  method  includes  tokens  with  one  character 
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such  as  separators  and  terminators.  Both  separators  and 
terminators  will  be  referred  to  as  delimiters  throughout  the 
program.  The  delimiters  are  defined  in  a  constant  set.  The 
Hex  values  of  the  set  can  be  interpreted  with  the  aid  of  the 
ARCH  table  (Appendix  D)  . 

Errors  are  a  user-defined  data  type.  Types  of 
errors  are,  for  example,  long_token,  long  comment,  and 
long_literal  string.  All  of  the  above  errors  are  expected 
to  occur  as  a  result  of  failure  of  the  programmer  to  end  a 
comment  or  literal  string. 

The  token  types  are  defined  to  be  one  of  the 
following: 

-  Blanks 

-  Reserved_word 

-  Identifier 

-  Literal_String 

-  Control_Code 

-  Comment 

-  Integer 

-  Functional_Operator 

-  Unclassified 

-  Illegal 

These  are  the  main  declarations  of  the  program.  The 
definition  of  the  tokens  and  assignments  of  the  variables 
will  be  covered  in  the  following  sections. 
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4 .   Token  Class.es  I  and  II 

Class   I   tokens   are   recognized  using   the   first 

method.   This  includes  the  following  types  of  tokens: 

Literal_String:   This  token  begins  with  Arabic  quote  mark, 

single  or  double,  and  ends  with  it.  The 
Hex  values  are  97  Hex  and  A2  Hex. 

Comments  :  Begins  with  right  bracket  followed  by- 
asterisk  and  ends  with  an  asterisk 
followed  by  left  bracket. 

Directives     :   Are  strings  ■  in  curly  brackets.    This 

feature  is  for  debugging.  The  directives 
will  allow  the  user  to  choose  between 
commented  Latin  source,  with  original 
comments,  and  debugging  option  to  display 
on  the  monitor  the  tokens  and  their 
types. 

Class  II  covers  the  identifiers,  including  reserved 

words,  and  integers.   The  remainder  of  token  types  will  be 

reviewed  shortly. 

Identifiers  and  Reserved  Words:    Begin  with  an  Arabic 

letter  followed  by  an  optional  number  of 
underscore,  digit,  or  other  Arabic 
characters. 

Integers  :  Begins  with  digit  and  ends  with  any  non- 
digit  character. 

The  remainder  of  the  token  types  are  Functional_Operator, 

Illegal,  and  Unclassified.   Functional_Operator  tokens  are 

the  arithmetic  operators,  brackets,  asterisk,  decimal  digit, 

semicolon,  colon,  pointer  '^',  etc.   The  illegal  token  is 

the  token  that  exceeds  its  defined  length.   This  condition 

is  used  to  set  an  error  message  to  pass  to  the  user  about 

the  location  of  an  error.   An  Illegal  token  is  also  set  if 

the  Hex  code  is  less  than  80  Hex.   The  legal  range  is  80  ... 
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FF  Hex.  The  control  code  is  any  escape  code  or  function 
call  within  the  range  of  Arabic  characters  ranging  from  80 
Hex  ...  FF  Hex.  The  Unclassified  token  type  is  used  as  the 
value  before  it  is  determined. 

D.   PROGRAM  MODULES 

The  Lexical  translator  will  call  several  procedures  and 
functions  in  the  process  to  generate  the  desired  code.   The 
main  body  of  the  program  calls  several  procedures  and 
functions.   The  program  modules  and  their  locally  declared 
procedures  and  functions  are  as  follows: 
Open_File 
Initialize 
Fill_Buffer 
Token_and_Type 
Blank 
Comment 

Literal_String 
Integer_Token 
Identif ier_Token 
Reserved_Token 
Special_Char_Token 
Control_Char_Token 
Map_Identif ier_To_Latin 
Search 
Hash_Key 
Insert:   calls  Id_No 
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Found 
Latin_Integer 
Get_Latin_Spec_Char 
Print_Error_Messages 

1.  Open_File 

The   program   starts   by   calling   the   Open File 

procedure.  The  procedure  will  prompt  the  user  for  the  name 
of  a  file  to  translate  and  verify  that  the  file  does  exist. 
The  second  part  is  to  open  the  input  file  for  reading,  reset 
the  Output  file  for  writing,  and  the  Dictionary  file  for 
writing. 

2 .  Initialize 

Initialize  procedure  will  set  all  the  hash  table 
pointers  to  nil.  The  nil  values  are  used  to  indicate  that 
there  are  no  words  with  that  key  number  yet  initialized.  It 
will  also  set  the  initial  values  of  global  variables.  The 
module  is  called  once  at  the  beginning  of  the  program. 

3.  Fill.Buffer 

This  procedure  will  get  a  line  of  source  code,  keep 
track  of  the  line  number  of  the  source  code,  and  set  the 
line  size  of  the  source  code.  This  module  is  continuously 
called  by  the  main  program  until  the  end  of  the  source  file 
is  reached. 

4 .  Buf fer_Empty 

This  function  will  test  to  see  if  the  variable  Next. 
Loc,  which  represents  the  next  token  location  on  the  line. 
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is  pointing  beyond  the  Line_Size  variable.   This  case  will 
set  the  function  to  true,  causing  the  main  program  to  call 
the  Fill_Buffer  procedure  to  refill  the  buffer.   This  module 
is  called  continuously  by  the  main  program. 
5 .   Token_And_Type 

When  called,  this  procedure  is  passed  a  line  of 
source  code  and  the  location  of  the  first  character  of  the 
token  to  be  fetched.  The  procedure  gets  the  token  and  gives, 
it  a  type.  The  procedure  initially  sets  the  type  of  the 
token  to  Unclassified  and  through  several  calls,  tries  to 
analyze  the  type  of  the  token.  The  first  convenient  check 
is  for  Comments.  It  should  be  noted  here  that  one  would 
like  to  place  the  most  likely  type  check  at  the  beginning  to 
reduce  time  of  analysis  of  the  token  type.  Another  reason 
for  searching  for  comments  first  is  because  they  are  the 
only  type  that  requires  two  characters  in  the  beginning  and 
the  end  of  the  token.  The  rest  can  be  predicted  just  by 
inspecting  the  first  character. 

If  the  token  type  is  not  set  to  Comment,  then  the 
module  calls  several  modules  with  a  case  statement.  The 
modules  are  called  based  on  the  first  character  after  the 
last  token  read.  The  Next  Location  variable  points  at  this 
character  in  the  input  line  buffer  called  "Line."  The 
possibilities  are: 
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FIRST  CHARACTER  LIKELY  TOKEN  TYPE 

Arabic  space  Blank (s) 

Double  or  Single  Quotes   Literal  string 
Arabic  Digit  Integer 

Arabic  Letter  Identifier 

Function  Code  Control  Char 

Other  Characters         Special  Characters 
Each  possible  token  type  above  represents  a  module.   The 
module  will  be  called  to  set  the  type  of  the  token. 

Looking  at  each  module  called  by  Token and Type, 

they  all  set  the  token  type  and  the  length  of  the  token. 
All  likely  token  types  except  for  Literal  Strings  and 
Comment  will  not  set  any  error  flags,  since  one  character 
will  satisfy  their  requirements.  For  example.  Blanks, 
Integers,  Identi-f iers.  Control  Characters,  and  Special 
Characters  all  could  be  one  character  long.  When  Literal 
String  and  Comment  modules  are  called,  they  must  begin  and 
end  with  a  predeter-mined  pattern.  So  an  open  comment  for 
longer  than  line  length  is  an  error,  and  the  same  for  a  long 
literal  string  token.  Token_And_Type  only  examines  the  Line 
Buffer  charac-ter  and  does  not  consume  it.  The  called 
modules  assign  the  character  to  the  Token  Buffer  and  advance 
the  pointer  of  the  Line  Buffer  one  character.  When  a 
successful,  token  type  is  assigned  the  module  sets  the  token 
length.  PASCAL  uses  the  first  array  location  to  store  the 
length  of  the  assigned  characters  in  bytes. 
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The  behavior  of  the  modules  called  by  Token and 

Type,  are  summarized  below: 

Blanks:  Will  keep  consuming  the  Line  Buffer  blanks 
(Arabic  and  Latin)  up  to  a  non_blank  character 
is  reached.  Blanks  will  set  Token  Type  and 
Length . 

Comment:  Consumes  the  characters  within  the  Arabic 
characters  range,  until  the  comment  closing 
mark  is  reached.    The  module  will  set  the 

error  set  to  long comment,  if  any  character 

lies  in  the  Latin  alphabet  range,  including 
the  end  of  file  and  carriage  return  (ASCII  OD, 
OA  Hex) .  The  error  is  long  comment  since  the 
comment  is  restricted  to  one  line  long. 
Comment  alters  the  opening  and  closing  bracket 
of  the  Arabic  comment  token.  The  characters 
are  the  Arabic  opening  brackets,  closing 
brackets,  and  the  asterisk,  having  the  Hex 
values  A8 ,  A9,  and  AA  respectively. 

Literal_String:  The  module  will  be  called  in  case  the 
next  characters  are  single  or  double  quotes. 
The  module  will  expect  to  be  terminated  with 
the  same  character  it  began  with.  If  the 
matching  character  is  not  reached  before  the 
end  of  the  line  it  is  considered  an  illegal 
token,  and  the  error  set  will  be  assigned  the 
type  long  token.  Valid  literal  strings  will 
not  be  altered.  However  the  opening  and 
closing  will  be  translated  to  single  or  double 
quotes  accordingly. 

Integer_Tok:  Stands  for  integer  token,  and  will  be 
called  when  a  digit  is  present.  The  module 
will  keep  assigning  the  Latin  digits  in  the 

token   buffer,   and   assign   the   Token Type 

Integer  to  the  variable  Tok_type. 

Identif ier_Tok:  Will  be  called  when  the  character  is  a 
letter.  The  single  letter  qualifies  as  an 
identifier  alone,  or  could  be  followed  by  an 
optional  number  of  Arabic  underscore,  digit, 
or  letter.  The  module  will  set  the  Tok_type 
to  Identifier.  The  module  has  no  effects  on 
error  set,  since  when  called  it  was  a  valid 
token  based  on  the  first  character  of  the 
token. 

Reserved_Tok:  The  module  is  called  when  the  token  found 
is  an  identifier.   The  module  will  check  if 
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the  token  is  in  the  reserved  words  constant 
array  called  "Res_Word."  If  the  identifier  is 
a  reserved  word  the  index  of  the  table  is 
passed  back  to  the  main  program. 

Control_Char_Tok:  The  module  is  called  when  a  BCON 
function  code  is  the  next  character  in  the 
Line_Buffer.  The  module  assigns  one  character 
(code)  to  the  token  buffer. 

Special_Char:  This  module  assigns  one  character  to  the 
token.  The  token  will  always  have  one 
character. 

When  Token_and_Type  returns  the  token  type  to  the 

main  program,  a  case  statement  will  either  call  a  procedure 

or  do  the  processing  with  a  compound  statement.   The  blanks 

will  be  translated  to  Latin  ASCII  code  blanks.   The  returned 

comment  token  will  be  written  out  as  is.   Literal  strings 

are  written  out  literally.   Reserved  words  are  written  as  is 

using  the  Match_Index  in  the  Res_Word  constant  array.   The 

identifiers   are   looked   up   in   the   symbol   table.     If 

predefined,  the  token  identifier  number  is  returned  with  it, 

or  else  the  identifier  is  inserted  in  the  table  and  given  an 

identifier  number.   The  module  used  is  called  Map_Iden_To 

Latin. 

6 .  Map_Iden_To_Latin 

The  Identifier  token  is  received  and  searched  for 
with  a  procedure  called  Search. 

7 .  Search 

This  module  starts  by  calling  the  Hash_Key  function. 
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a.  Hash_Key 

Hash_Key  calculates  the  token  key_no  with  a  hash 
formula.  The  key  number  is  used  to  look  up  the  pointer  of 
the  word  record  in  the  hash  table.  The  word  record  is  a 
linked  list  of  identifiers  of  the  same  key  number.  All  the 
pointers  are  initialized  to  nil  at  the  beginning  of  the 
program.  If  the  key  number  results  in  a  nil  pointer  value, 
that  means  there  is  no  such  word  in  the  symbol  table,  nor 
any  other  word  with  the  same  hash  key  number,  then  Search 
calls  Insert  to  insert  the  identifier  in  the  symbol  table. 

b.  Insert 

Insert  creates  a  word  record  at  the  beginning  of 
the  linked  list  and  stores  the  identifier  in  the  spelling 
table.  Insert  makes  a  call  to  IDEN_LBL_NO,  which  uses  the 
global  variable  ID_NO  (sequence  of  appearance) ,  and  assigns 
an  identifier  number  in  the  word  record. 

If  the  pointer  is  pointing  at  a  word  record,  then  the  first 
word  in  the  linked  list  is  checked,  and  so  on  until  there 
are  no  more  word  records  in  the  list  or  the  word  is  found. 

c.  Found 

The  function  Found  checks  if  the  resulting 
pointer  is  pointing  at  the  exact  identifier  spelling. 
If  the  word  record  is  found  then  it  already  has  been 
assigned  a  specific  identifier  number  which  is  then  passed 
back  to  the  main  program  to  be  written  out  as  the  Latin 
identifier. 
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8 .  Latin_Int 

The  procedure  maps  each  digit  of  the  token  to  the 
Latin  digit  0...9,  and  passes  back  the  Latin  integer. 

9 .  Get_Latin_Spec_Char 

The   procedure   is   to   give   each  Arabic   special 
character  its  Latin  "functionally"  equivalent  character. 
10.   Print  Errors 

Based  on  the  error  set,  Print  Errors  will  send  the 
error  type  and  the  line  number  in  the  source  code  where  it 
was  encountered. 

E.   PROGRAM  DIRECTIVES 

The  program  offers  two  directives.  One  is  the  option  to 
keep  the  source  comments  in  the  output  file,  or  the  program 
will  omit  the  comments  by  default.  Two  is  the  option  to 
turn  on  and  off  the  debug  option  at  any  location  in  the  code 
at  the  beginning  of  a  line.  This  option  will  display  the 
tokens  and  their  types  as  they  are  scanned. 

The  program  is  demonstrated  by  a  list  of  test  runs  to 
verify  the  translation  of  reserved  words  and  special 
characters.  Also  a  sample  of  small  PASCAL  programs  are 
included  with  their  generated  files,  code  and  dictionary 
tables  (Appendix  I) . 

E.   LIMITATIONS 

The  program  does  not  allow  the  user  to  use  the  'Include' 
directive  in  TURBO  PASCAL.    The  size  of  the  program  is 
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limited  by  TURBO  PASCAL  to  64k,  where  an  additional  code 
could  be  included  as  an  'Include'  file. 

The  program  is  set  to  handle  up  to  one  thousand 
identifiers.  This  is  a  reasonable  number  in  working  with 
TURBO  PASCAL  since  the  program  size  is  limited  to  64k  bytes. 

The  spelling  table  is  5000  characters  long.  That  means 
the  total  length  of  all  identifiers  can  not  exceed  5000 
characters.  The  programmer  can  avoid,  when  writing  long 
programs,  exceeding  the  limit  by  using  short  identifiers. 

The  program  will  not  generate  an  error  flag  if  a  Latin 
string  is  found  in  comments  or  literal  string.  This  is 
because  both  comments  and  literal  strings  are  not  altered. 

ARCH  provides  two  commas.  The  numeric  comma  is  used 
with  real  numbers  in  Arabic,  and  the  Arabic  Comma  is  used, 
in  this  specification,  as  the  Latin  comma  except  for  the 
real  number  case.  This  is  a  small  hurdle  in  the  case  of 
translating  the  generated  code  back  to  Arabic.  The 
appearance  of  the  two  Arabic  commas  is  different.  They  are 
180°  out  of  phase  on  the  vertical  axis  where  the  numeric 
comma  looks  like  the  Latin  comma.  The  decision  on  using 
both  commas  was  to  avoid  overloading  the  use  of  the  Arabic 
comma . 
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VII.   CONCLUSION 

This  thesis  has  tried  to  narrow  the  gap  between  educated 
Arabic-speaking  people  and  computers  in  general.  The  target 
ages  are  mid-teenage,  and  forty-five  and  above.  The 
majority  of  these  two  classes  still  look  at  computers  as 
magic.  They  believe  man  created  them.  However  they  have  a, 
hard  time  believing  that  man  tells  computers  what  to  do. 
With  that  attitude,  the  only  thing  that  can  convince  them  is 
to  help  them  to  write  small  programs  and  see  the  results. 
We  are  convinced  that  the  majority  will  get  rid  of  their 
fear  and  have  the  desire  to  explore  this  machine. 

In  short,  the  topic  of  the  interface  between  the  rich 
Latin  software  library,  and  the  Arabic  language  environment 
is  a  promising  area  in  the  sense  that  it  will  bring  those 
who  fear  computers  closer,  and  find  a  more  efficient  way  to 
get  the  job  or  hobby  done. 

A.   CONCEPT  FUTURE 

The  program  is  simple  in  concept  and  to  code,  but  the 
environment  where  it  is  expected  to  work  is  not  yet 
standardized.  The  standards  are  not  widely  implemented,  nor 
are  the  developers  of  bilingual  operating  systems  very 
helpful  in  responding  to  concerns  about  hardware 
compatibility. 
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Once  a  unified  environment  is  established,  then  the 
concept  could  be  developed  further.  The  goal  of  this  work 
was  to  illustrate  the  feasibility  and  avoid  specific  issues 
of  the  implementation  environment.  The  program  modules  were 
designed  to  be  adaptable  and  portable  for  several  purposes 
with  little  modification.   For  example: 

-  For  several  programming  language  translations,  such  as 
"C,"  FORTRAN,  and  BASIC,  we  only  need  several  resource 
files  and  several  special  character  sets,  one  for  each 
programming  language  requirement. 

-  For  several  code  sets,  including  different  languages,  we 
need  the  concept  of  a  bilingual  operating  system  that 
uses  the  upper  range  of  the  character  set  ranging  from 
8  0  ...  FF  Hex. 

-  The  program  can  work  in  a  Latin-only  operating  system, 
to  translate  source  codes  that  have  been  edited  using 
Arabic  code  set  values.  Also,  the  generated  source 
could  be  compiled  in  the  same  machine.  If  the  program 
is  interactive,  then  it  needs  to  run  under  a  bilingual 
operating  system. 

B.   LIMITATIONS 

The  bilingual  operating  system  was  not  well  documented 
as  far  as  how  some  of  the  function  codes  are  implemented 
during  editing.  Some  of  the  characters  have  two  codes  (such 
as  the  Arabic  multiply  sign  and  the  numeric  multiply  sign) . 
To  know  which  multiply  sign  is  generated  when  I  strike  a 
key,  I  had  to  use  an  editing  tool  to  display  the  code  in  Hex 
values  and  match  the  text  file  and  its  Hex  values. 

Right  indentation  is  relative  to  the  editor  mode.  If 
you  select  your  screen  mode  to  be  Arabic  and  you  read  a 
piece  of  Latin  code,  it  will  be  right  justified. 
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The  user  must  be  careful  reading  data  files.  Some  data 
is  readable  only  in  Arabic  mode  and  some  data  is  readable 
only  in  Latin.  Also  the  data  displayed  may  have  been 
transformed  by  the  operating  system.  As  mentioned  before, 
the  user  could  use  the  "SWAP"  option  for  altering  ASCII 
digits  and  ARCH  digits  in  the  DOS  environment,  or  read  the 
digits  as  a  string  and  change  the  values  into  ASCII.  This 
is  important  in  order  to  perform  numerical  operations  with 
Arabic  digits. 

I  strongly  believe  that,  with  time,  standards  will  be 
developed  with  more  care  and  concern  for  the  user.  This  is 
the  reason  we  chose  not  to  design  the  program  for  a  specific 
system. 

It  is  hoped  that  this  work  will  benefit  other 
researchers  and  future  thesis  students  from  other  countries 
since  a  similar  concept  could  be  applied  to  other  languages, 
especially  languages  descended  from  Latin. 
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APPENDIX    A 
FIGURES 


i5      <j      t      fc      *»      •»       u-^      o-o      o^       0-" 

ts  3  •  0        r        0        ii 

Figure  1.   The  2  8  Arabic  Alphabets 


li    £    £    B    b    j^    j^    yjii         jjj 

Figure  2.   The  31  Alphabets  (Optimum  Set) 
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KAF 
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THAL 
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LAM 
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RA 
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MEEM 
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ZA 

J 

NOON 

0 

SEEN 

U-" 

HA 

a 

SHEEN 

0-" 

WAM 

J 

SAD 

o-^ 

YA 

cS 

HAMMAZAH  » 

TAAMARBOTA  5 

ALEF    MAQSURA  ^ 


Figure    3.      Arabic   Alphabet   Names 
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Figure  4.   Arabic  Diacritics  (Vowelization) 
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Eastern  Hindu  numerals 

9876543210 
Ulestern    Hindu    numerals 


Figure  5.   Hindu  Numerals 
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^j^ali   tlix    S^U^^   (^5AX4    dil^i    ti^lil   gAX^    o^    tlcl^ 


a.      vrithout   Vowels 

(A-Cl  p      Ji.-xlu--dJi    lA-A    (^1     ci-A  Ji_i     0  i     <-^     ;  -^-^ 
t 

^-^o-aII  4;>-4    ^jj^^  Ai^  ^li^l  ^1l^  (.^Olc    'la  ;-^1    cLaIs  \ 

b.   Uith  Vowels 
Figure  6.   Arabic  Text 
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Keyboard 
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dispi&y 
codes 


key  codes 
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reduced 
codes 


Operating 

system 

and 

applications 


Figure  7.   BCON  Code  Sets 
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APPENDIX  B 

TEXAS  INSTRUMENTS  APPROACH  TO 
BILINGUAL  OPERATING  SYSTEM 


Philosophy  of  Bilingual  Arabic 
Latin  Implementation  on  Microcomputer 

System 


Texas  Instruments 
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ARABIC  COMPUTER  SYSTEMS 
PHILOSOPHY 


SPECIFIC  CHARACTERISTICS  OF  THE  ARABIC  LANGUAGE 


•  ARABIC  IS  WRITTEN  FROM  RIGHT  TO  LEFT 

•  THERE  ARE  SOME  VARIATIONS  IN  TYPES  OF  ARABIC  CURRENTLY  IN  USE  IN 
DIFFERENT  COUNTRIES 

•  THE  LANGUAGE  IS  A  FOUR  LEVEL  ONE.  A  CHARACTER  CAN  HAVE  UP  TO 
FOUR  SHAPES  DEPENDING  ON  ITS  POSITION  IN  THE  WORD  :  ISOLATED, 
INITIAL,  MEDIAL  OR  FINAL 

•  ARABIC  CHARACTERS  ARE  JOINED  WITHIN  A  WORD 

•  NO  UPPER  CASE  EXISTS  IN  ARABIC 

I'll  start  my  presentation  by  a  brief  mentioning  of  some  of  the  characteristics  ofthe 
Arabic  language  which  have  been  covered  in  previous  papers  and  u  Inch  alTect  the  use  of 
the  Arabic  language  in  the  computer  field. 
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ARABIC  COMPUTER  SYSTEMS 
PHILOSOPHY 


SPECIFIC  CHARACTERISTICS  OF  THE  ARABIC  LANGUAGE 


ONLY  THREE  CHARACTER  VOWELS  EXIST  IN  ARABIC  : 
ALIF       I       ,  OUAOU       J    ,      YAA     ^      . 

VOWELISATION  IN  ARABIC  IS  ALSO  PERFORMED  THROUGH  THE  USE  OF 
DIACRITICS.  THESE  ARE  USED  : 

-  IN  THE  CASE  OF  SIMILARLY  WRITTEN  WORDS  TO  AID  THE  READER 

-  IN  RELIGIOUS  TEXTS  INCLUDING  THE  KORAN 

-  FOR  SCHOOL  TEACHING 

ARABIC  LANGUAGE  USES  INDIAN  NUMERICS,  WITH  THE  DECIMAL  POINT 
BEING  A  COMMA. 

THERE  ARE  ARABIC  SPECIAL  CHARACTERS  WHICH  INCLUDE  THE  ARABIC 
COMMA      *        ,  SEMICOLON        J     ,      QUESTION  MARK       J     ,  ETC. 


I 
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ARABIC  COMPUTER  SYSTEMS 
PHILOSOPHY 


ARABIC  ALPHABET 

•  THE  BASIC  ARABIC  ALPHABET  IS  COMPOSED  OF  28  CHARACTERS 

•  THE  LAMALIF  WHICH  IS  COMPOSED  OF  TWO  CHARACTERS      LAM  +  ALIF 
IS  CONSIDERED  AS  ONE  CHARACTER 

•  THE  HAMZA  CAN  BE  WRITTEN  IN  MANY  DIFFERENT  WAYS  IN  ARABIC 
DEPENDING  ON  ITS  USE,  WITH  A  VOWEL  OR  ISOLATED 

•  IF  THESE  TWO  CHARACTERS  ARE  TAKEN  INTO  CONSIDERATION  THE 
ALPHABET  IS  30  CHARACTERS 

•  THE  TAMARBOUTA  IS  A  SPECIAL  CHARACTER  NOT  INCLUDED  IN  THE 
ALPHABET.  IT  IS  OCCASIONALLY  INCLUDED  AT  THE  END  OF  WORDS 
DEPENDING  ON  GRAMMATICAL  RULES 

.17- 


r 
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ARABIC  COMPUTER  SYSTEMS 

BILINGUAL  SYSTEM  APPROACH 


SOLUTION  1  :  CORRESPONDANCE  &  DIFFERENCES 

THIS  STUDY  IS  BASED  ON  THE  CORRESPONDANCE  AND  DIFFERENCES 
BETWEEN  ARABIC  CHARACTERS.  THE  ARABIC  ALPHABET  MAY  BE  CONSIDEREC 
AS  FORMED  OF  THREE  TYPES  OF  CHARACTERS  : 

-  TYPE  A  INCLUDES  CHARACTERS  HAVING  1,  2,  OR  3  POINTS  : 

J     ^-J    O    C->     J 

-  TYPE  B  INCLUDES  CHARACTERS  WITHOUT  POINTS  : 

^   -    r   J    ^ 

-  TYPE  C  INCLUDES  CHARACTERS  HAVING  AT  LEAST  ONE  FORM  IN  EACH  CASE 

IF  WE  ONLY  CONSIDER  THE  FORMS  WITHOUT  POINTS  WE  CAN  REDUCE  THE 
CHARACTERS  IN  EACH  TYPE  AND  THEN  ADD  THE  POINTS  AFTERWARDS 

f      1    T 


^ 
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ARABIC  COMPUTER  SYSTEMS 
BILINGUAL  SYSTEM  APPROACH 

SOLUTION  2  :  ROOTS  &  APPENDICES 


A  STUDY  BASED  ON  THE  USE  OF  APPENDICES  AND  ROOTS  TENDS  TO 
REDUCE  THE  TOTAL  NUMBER  OF  SHAPES  BY  CONSIDERING  A  ROOT  TO  BE 
USED  IN  INITIAL  «<  MEDIAL  SHAPES  TO  WHICH  AN  AIVLNDIX  IS  ADDED  TO 

FORM  THE  FINAL  OR  ISOLATED  SHAPES 


TYPEC 

^^  =  V.  +  J 

"  =  L  +  J 

*^  =  ^  +  J 

^  =  ^  ■*■  5 

o  =  V.  +  3 

o  =  C  +  J 


The  problem  with  this  solution  is  what  code  to  give  to  these  apprcndices.  il  tliey  are  coded  wcniid 
they  be  ccjiisidered  as  ciiaracters  in  a  character  count'  How  would  high  level  languages  interpret  them? 
f-low  uould  special  s/w  lunction  interpret  them?  replace  —  insert  —  tind  string. 
1  his  IS  the  study  which  resulted  in  the  actual  Arabic  implementation  on  Texas  Instruments  equipment 
and  which  will  be  explained  in  this  paper. 


TYPE  A 

TYPES 

s    =    -  .  + 

• 

o"   =^    + 

^    =   .    -*- 

^ 

^^   =^    + 

s-   =    ^    + 

• 

u^    =^     -»• 

£    =    ^     + 

i- 

u^     =^      + 

e  =   ^   + 

* 

^ 

c   =  ^    ■*■ 

A 
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-  ARABIC  COMPUTER  SYSTEMS 
BILINGUAL  SYSTEM  APPROACH 

SOLUTION  3  :  CONTEXTUAL  ANALYSIS 

A  STUDY  BASED  ON  THE  USE  OF  SHAPING  ALGORITHMS.  USING  CONTEXTUAL 
ANALYSIS  TO  DETERMINE  THE  PROPER  SHAPE  OF  THE  CHARACTER,  FOUR 
GROUPS  ARE  IDENTIFIED 

-  GROUP  1  ONE  SHAPE  PER  CHARACTER 

-  GROUP  2  TWO  SHAPES  PER  CHARACTER 

-  GROUP  3  THREE  SHAPES  PER  CHARACTER 

-  GROUP  4  FOUR  SHAPES  PER  CHARACTER 

POSSIBLE  APPROACHES 

-  ONE-KEY  ONE-SHAPE  SIMPLIFIES  THE  SOFTWARE  BUT  USUALLY  LIMITS  THE 
SET  OF  ARABIC  CHARACTERS  AND  CREATES  A  COMPLEX  KEYBOARD  SINCE 
ALL  THE  ARABIC  CHARACTER  SHAPES  MUST  BE  PRESENT  ON  THE 
KEYBOARD. 

-  ONE-KEY  MANY-SHAPES  IMPLIES  MORE  SOPHISTICATED  SOFTWARE  BUT 
SIMPLIFIES  KEYBOARD  &  USER  INTERFACE 


()l  thcsL'  2  approiiclics  tlic  2nd  odc  has  licrn  chosen  ;n«.l  ihis  will  he  covcrcti  in  the  lollouinu  sliJcs. 
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APPENDIX  C 

DS99  00  BILINGUAL  COMPUTER 
SYSTEM  BY  TEXAS  INSTRUMENTS 

ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 


COMMERCIAL  COMPUTING  REQUIREMENTS  FOR  THE  MIDDLE-EAST 

•  BILINGUAL  LATIN/ARABIC  DATA  INPUT  &  OUTPUT 

•  COBOL  DRIVEN  APPLICATIONS 

•  BILINGUAL  PRINTING 

•  BILINGUAL  SORT/MERGE 

SPECIAL  PRODUCTS  DEVELOPPED  TO  MEET  REQUIREMENTS 

•  BILINGUAL  DATA  ENTRY  TERMINAL 

•  BILINGUAL  MATRIX  PRINTER 

•  BILINGUAL  LINE  PRINTER 

SOI   IWARE 

I  hcso  liaiullc  boili  in  tlic  natural  manner  +  soliwarc  siniplilicd  k/w  lor  operators  +  l)igh  level 
lan':ua<:es  easv  liandline. 
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ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 


CHARACTERISTICS  OF  BILINGUAL  DATA  ENTRY  TERMINAL 


•      BILINGUAL  VIDEO  DISPLAY  UNIT 

-      THE  CHARACTER  GENERATOR  ROM  GENERATES  7x8  MATRIX  FOR  ALL 
STANDARD  ASCII  CI  lARACTERS  AND    1  28  ARABIC  SHAPES 

A  7   X    10  MATRIX  IS  USED  FOR  INTRICATE  ARABIC  CHARACTERS 


Latin  &  Arabic  can  be  displayed  on  the  screen  at  the  same  time. 
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ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 

•      BILINGUAL  KEYBOARD 

-  PROVIDES  5  MODES  OF  OPERATION  :  ARABIC,  LATIN,  SHIFT,  UPPERCASE 
&  CONTROL.  IT  CONSISTS  OF  91  KEYS 

-  PROVIDES  THE  USER  WITH  THE  CAPABILITY  OF  ENTERING  ARABIC 
AND/OR  LATIN  DATA  WITHOUT  CONSTRAINTS 

-  KEYBOARD  MULTIFUNCTION  CAPABILITY  IS  PROVIDED  BY  A  MODE 
SELECTION  KEY  AND  TWO  CHARACTER  SET  SELECTION  KEYS 

-  DATA  IN  EITHER  LANGUAGE  CAN  BE  ENTERED  IN  EITHER  MODE 

-  THE  KEYBOARD  GENERATES  7-BIT  CODES  FOR  LATIN  AND  8-BIT  CODES 
FOR  ARABIC 


f  1 

'2 

F3 

r* 

fi 

f« 
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fi 
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tOASE 
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INPU' 

POINT 
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2      (S 
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5     j 

8     J 

7      & 

•       t 
S       . 

1     T 
9      1 

1     T 
0       1 

1     ^ 

1    • 
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ENTEB 

°^ 
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1    A 
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CHAR 
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k«O0f 
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Hasic  placcincnts  ol  Arabic  key  like  tyitcu  rilcr. 
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ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 

ARABIC  CHARACTER  SHAPING 

•  32  BASIC  ARABIC  CHARACTERS  ARE  GENERATED  BY  THE  KEYBOARD 

•  A  CONTEXTUAL  ANALYSIS  OF  THE  ARABIC  DATA  IS  PERFORMED  BY  THE 
CONTROL  PROGRAM  TO  DETERMINE  THE  CORRECT  SHAPE  OF  THE 
CHARACTER  TO  BE  DISPLAYED 

•  IN  TOTAL  THE.  TERMINAL  CAN  DISPLAY  1  1  5  SHAPES 
EXAMPLE  OF  SHAPING  PROCEDURE 


I  Yl  '!_      I  I  ll_     UIUJ<1J 
ENTER     YAA 

TA 

KAF 

LAM 

MIM 

SPACE 


r'-J^^-v 


£i^. 


J^Sm^ 


r'-i^^. 
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ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 


DEVICE  SERVICE  ROUTINE  INTERFACE  OVERVIEW 


•     THE  DEVICE  SERVICE  ROUTINE  IS  CONTROL  SOFTWARE  BETWEEN  THE 
USER'S  PROGRAM  AND  THE  VIDEO  DISPLAY  TERMINAL  (VDT) 


S>slcm  llcMbilily  by  Soliuaic  iinplciiicnt. 


USER 
SUFFER 


\\/\\ 
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ARABIC  GOMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 

BILINGUAL  TERMINAL  PROGRAM  INTERFACE 
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ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 

BILINGUAL  TERMINAL  DISPLAY  ROM  INTERFACE 
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I'roblcins  of  Arabic  Numerics  imist  use  ASOTT  lunncric  cotlc  lor  COBOL.  I  OR  IRAN. 
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APPENDIX  D 
BCON  BILINGUAL  QPEPATING  SYSTEM  BY  ALIS  INC. 


Default  Reduced  Codes 


(1  to  "f 

1  .Mill  chjr.icitT-    idcniiL.il  to  onum.il    N^Cil  ^f    u  i:h  trif  cxn'ptum  iM  liu' 

t')IK)uini;  two  iharoLtiT^ 

(It 

h unction  code  ->('t  Bilin^u.il  -'(.rffn  1. 1 •.-'(.■« tin;;  \IovN-  (lmj>:t'd  os  Latin  bpace* 

Of 

hunction  eodt-  ^.'t   l..i;!'vC):ii\    :?cr(,'fii  C)rt.T.itire  Moiir  (lmiii;ed  as  Ljtin 

space) 

^c 

\umoric*  space 

81 

=         Arabic**  number  sign 

82 

X         Numeric  multiply  sign 

83 

&         Arabic  ampersand  sign 

8-4 

Arabic  apostrophe  sign 

8T 

Numeric  percent  sign 

M^ 

+         Numeric  divide  sign 

87 

1           Numeric  lett  parenthesis 

88 

i           ".umenc  right  parenthesis 

8*^ 

Numeric  plus  sign 

8A 

Numeric  minus  sign 

8B 

Numeric  less  than  sign 

8C 

Numeric  equals  sign 

8D 

Numeric  greater  than  sign 

8E 

Function  rode  -^et  Arar'u  ^^ree^.  1  aneuage  Modt  (imaged  as  Arabic  space) 

8P 

f-unction  code  ^e-  i.itin  ^cre<'n  i.aiiiiuji'i-  Mode  (imaged  as  Arabic  space) 

yo 

Function  code  set  Araru  Line  Lani;uaee  M'd.   (imaged  as  Arabic  space) 

Ml 

Function  code  Si-i  l.atin  1  ine  l.ani:uai;e  Mo,:,    (imaged  as  Arabic  space) 

v: 

(o         Arabic  commercial  at  sign 

03 

Arabic  lett  square  bracket 

'^4 

,           Arabic  right  square  bracket 

tJS 

Arabic  upward  arrow  head 

9h 

Arabic  underline 

'^7 

Arabic  reverse  apostrophe 

gs 

Arabic  lett  curK  bracket 

M<( 

Arabic  vertical  line 

4  A 

Arabic  right  curl\   bracket 

MB 

Arabic  tilde 

MC 

(reserved) 

9D 

(reserved) 

ME 

Function  code  set  Line  B<ainoar\  (imaged  as  Arabic  space) 

Mf 

( reser\  ed  i 

('):  Numenc  means  character  is  Arabic  but  has  intrinsic  right 

spacing  (i.e.  vvili  be  considered  part  ot  a  numeric  string). 
•*'•  Arabic  means  character  has  intrinsic  left  spacing. 


96 


All 

Arjpic  '-piU'i' 

Ai 

\riibic  I'vilamatiiin  mark 

\: 

\r::pu  ouutation  mjrk 

\^ 

X 

\r.inn.  niuitipK   >ii;ii 

A  4 

- 

Arjhii.  dulljr  sii;ii 

A? 

Arabic  percfnt  sign 

Ao 

Arabic  period 

A' 

•T 

Arabic  dis  idc  siiin 

AS 

Arabic  lert  parenthesi'- 

A'J 

) 

Arabic  right  parenthesis 

A. A 

■ 

Arabic  asterisk 

AB 

Arabic  plus  sign 

AC 

\ 

Arabic  comma 

AD 

Arabic  minus  sign 

AE 

< 

Numeric  comma 

Al- 

Arabic  solidus 

B(t 

• 

Arabic  digit  0 

Bl 

1 

Arabic  digit   1 

B2 

T 

Arabic  digit  2 

B^> 

T 

Arabic  digit  3 

B4 

£ 

Arabic  digit  4 

B^ 

0 

Arabic  digit  5 

B6 

-[ 

Arabic  digit  b 

B7 

V 

Arabic  digit  7 

Bh 

A 

Arabic  digit  8 

B^ 

^ 

Arabic  digit  ^ 

BA 

: 

Arabic  colon 

BB 

i 

Arabic  semicolon 

BC 

< 

Arabic  greater  than  sign 

BD 

s 

Arabic  equals  sign 

Bf 

> 

Arabic  less  than  sign 

Bf 

<• 

Arabic  question  mark 
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CO 

a 

TAIl 

CI 

^ 

KA5H1DA 

C2 

- 

SHADDAH 

C3 

. 

SLKL'N 

C4 

- 

KATHA 

C? 

s 

SHADDAH  FAT  HA 

Cb 

9 

FAT'HATAN 

C7 

2 

SHADDAH  FAT'HATAN 

C8 

^ 

DA MM AH 

CJ 

i 

SHADDAH  DAMMAH 

CA 

^ 

DAMMATAN 

CB 

Jf 

SHADDAH  DAMMATAN 

CC 

- 

KASRAH 

CD 

■i 

SHADDAH  KASRAH 

CE 

* 

KASRATAN 

CF 

S 

SHADDAH  KASRATAN 

DO 

< 

HAMZAH 

Dl 

t 

ALEF 

d: 

T 

VVASLA  ON  ALEF 

D3 

! 

HAMZAH  ON  ALEF 

D4 

1 

HAMZAH  UNDER  ALEF 

D3 

T 

MADDAH  ON  ALEF 

D6 

«_> 

BA  A 

D7 

>>>i 

PEH 

DH 

a 

TA  A  MARBUTA 

D9 

w 

TAA 

DA 

O 

THA'A 

DB 

c 

lEEM 

DC 

2 

SHEEM 

DD 

c 

HAA 

DC 

r 

KHA  A 

DF 

j 

DAI. 
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no 

■i     THAL 

LI 

J     l<  \ 

l; 

w'     ZM\ 

1  ^^ 

^     MLM 

F4 

u'      ^LL\ 

L^ 

uT      SHEEN 

Lh 

v>»      bAD 

i: 

ur*      DAD 

HH 

i     T\H 

E'* 

J»     DHAH 

CA 

^     AIN 

EB 

^     GHA1\ 

EC 

^     EA 

ED 

j     QAE 

EE 

d     CAE 

EF. 

-^     CAE 

EO 

J      LAM 

Fl 

"^     LAMALEE 

i: 

''J     WASLA  ON  LAMALEE 

E3 

V     HAMZAH  0\  LAMALEE 

F4 

>     HAMZAH  L\DER  LAMALEE 

FS 

"V     MADDAH  ON  LAMALEE 

F6 

/»     MEEM 

E7 

O     NOON 

EK 

c      HA 

EM 

^    WAW 

EA 

3     HAMZAH  ON  UAVV 

EB 

^     ALEE  MAQSLRA 

EC 

3    >AA 

ED 

^    HAMZAH  ON  VA  A 

\L 

\     Arabu   reverse  solidus 

[  f- 

BIjnk  "[V     chawrter  (imji;t'd  as  Ara 

bu  space) 
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Key  Codes  to  Reduced  Codes  Table 


Key  code* 

English 

Reduced  Code 

Arabic 

Arabic 

(ASCII) 

Legend 

(ARCIU 

Legend 

Name 

20 

-i'-\i.  • 

AO 

-['-\v: 

ArjDic  space 

21 

Al 

1 
• 

Arabic  ' 

■)•> 

A2 

It 

Arabic 

23 

ZI 

81 

# 

Arabic  = 

24 

S 

A4 

$ 

Arabic  S 

25 

0^ 

A5 

X 

Arabic  °o 

26 

& 

83 

8.- 

Arabic  & 

27 

■ 

E8 

> 

TAH 

28 

( 

A8 

( 

Arabic  ( 

29 

) 

A9 

) 

Arabic  ) 

2A 

• 

AA 

■ 

Arabic  * 

2B 

- 

AB 

♦ 

Arabic  + 

2C 

F9 

J 

WAW 

2D 

- 

AD 

- 

Arabic  - 

2E 

E2 

J 

ZAIN 

2F 

/ 

E9 

> 

DHAH 

30 

II 

BO 

• 

Arabic  0 

31 

1 

Bl 

1 

Arabic  1 

32 

■^ 

B2 

T 

Arabic  2 

33 

3 

B3 

T 

Arabic  3 

34 

4 

B4 

i 

Arabic  4 

35 

5 

B5 

0 

Arabic  5 

36 

6 

B6 

1 

Arabic  6 

37 

7 

B7 

V 

Arabic  7 

38 

s 

B8 

A 

Arabic  8 

39 

9 

B9 

^ 

Arjbic  9 

3A 

BA 

: 

Arabic  : 

3B 

EE 

d 

KAF 

3C 

AE 

• 

Arabic  numenc 

comma 

3D 

= 

BD 

3 

Arabic  = 

3E 

A6 

. 

Arabic 

3F 

■) 

BF 

c 

Arabic  ■> 

D:  Character  byte  of  key  code  word  onlv  (low-order  bvte).  The 
scan  code  (high-order  byte)  is  not  modihed  by  BCON. 
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40 

(?■ 

92 

0 

Arabic  (gj 

41 

A 

cc 

- 

KASKAH 

42 

B 

F- 

^ 

MADDAH  0\  LAMALEF 

43 

C 

9" 

{ 

Arabic 

44 

!■ 

93 

c 

Arabic  [ 

4=i 

E 

C8 

« 

DA.MMAH 

46 

h 

94 

] 

Arabic  ] 

47 

G 

F3 

V 

HAMZAH  0\  LAMAl.EF 

48 

H 

D3 

! 

HAMZAH  ON  ALEF 

4Q 

1 

A7 

■T 

Arabic  divide  sign 

4A 

J 

CI 

— 

KA5HIDA 

4B 

k: 

AC 

1 

Arabic  comma 

4C 

L 

AF 

/ 

Arabic  / 

4D 

M 

84 

■ 

Arabic 

4E 

N 

D5 

T 

MADDAH  ON  ALEF 

4F 

O 

A3 

K 

Arabic  multiplv  sign 

50 

r 

BB 

( 

Arabic  semi-colon 

51 

Q 

C4 

' 

FATHA 

52 

K 

CA 

^ 

DAMMATAN 

53 

s 

CE 

0 

KA5RATAN 

54 

T 

F4 

> 

HAMZAH  UNDER  LAMALEf 

55 

L 

97 

X 

Arabic 

56 

V 

9A 

} 

Arabic  , 

57 

V\ 

C6 

0 

FAT  HATAN 

58 

.X 

C3 

• 

SUNKL'N 

59 

> 

D4 

1 

HAMZAH  UNDER  ALEF 

5A 

z 

CO 

a 

TAIL 

5B 

1 

DB 

Z 

lEEM 

5C 

FE 

\ 

Arabic  \ 

5D 

1 

DF 

i 

DAL 

5E 

* 

95 

^ 

Arabic  " 

5F 

_ 

96 

- 

Arabic 

101 


60 

EO 

i 

THAL 

61 

E5 

d^ 

SHEEN 

62 

- 

Fl 

V 

LAMALEF 

6^ 

FA 

^1 

HAMZAH  ON  VVAW 

64 

FC 

^ 

YA  A 

65 

i 

DA 

c^ 

THA  A 

66 

Do 

w 

BA  A 

67 

._ 

FO 

J 

LAM 

68 

h 

Dl 

t 

ALEF 

69 

1 

F8 

-o 

HA 

6A 

. 

D9 

CJ 

TAA 

6B 

k 

F7 

(J 

NOON 

6C 

F6 

r 

MEEM 

6D 

rr, 

D8 

6 

TAA  MARBUTA 

6E 

n 

FB 

v5 

ALEF  MAQSURA 

6F 

o 

DE 

c 

KHA  A 

70 

F 

DD 

c 

HA  A 

71 

*^ 

E7 

«> 

DAD 

72 

r 

ED 

o 

QAF 

73 

- 

E4 

a- 

SEEN 

74 

• 

EC 

«j 

FA 

75 

u 

EA 

t 

AIN 

76 

\ 

El 

-> 

RA 

77 

w 

E6 

cr» 

SAD 

78 

\ 

DO 

« 

HAMZAH 

79 

\ 

EB 

t 

CHAIN 

7A 

/ 

FD 

« 

HAMZAH  ON  YA'A 

7B 

BE 

> 

Arabic  > 

7C 

; 

99 

; 

Arabic  i 

7D 

BC 

< 

Arabic  < 

7E 

- 

C2 

- 

SHADDAH 

Kev  code 

English 

Reduced  Code 

Arabic 

Arabic 

(scan   -  char) 

Legend 

(ARCin 

Legend 

Name 

1 HOU 

A;i    1 1 

D7 

y^ 

TEH 

1800 

■  Alt     .■ 

D7 

w 

PEH 

190(1 

A;:    r 

DC 

z 

SHEEM 

1900 

Alt    r 

DC 

c 

5HEEM 

2S0() 

A"     1- 

E3 

J 

SEEM 

2500 

A'     ., 

E3 

J 

SEEM 

260(1 

All      : 

EF 

^ 

CAF 

2600 

A" 

EF 

s 

CM 
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Display  Codes 


Displa\ 



Reduced 

Same                                                                 (shape)  (*) 

code 

code 

Ou-Fh 

Oto  l-F 

Latin  cnaracterb.  identical  to  original  ASCII  bet.  with  thf 
exceptic>n  ct  the  following  two  characters: 

OE 

OF: 

Function  code  OE 

or 

Function  code  OF 

10(t 

C2 

SHADDAH 

101 

C3 

SLKUN 

lo: 

C4 

FATHA 

103 

C? 

SHADDAH  FAT  HA 

104 

Cb 

FATHATAN 

10? 

C7 

SHADDAH  FAT  HATAN 

106 

C8 

DAMMAH 

lor 

C9 

SHADDAH  DAMMAH 

108 

C-\ 

DAMMATA\ 

lOv 

LB 

SHADDAH  DAMMATAN 

lOA 

CC 

KASRAH 

lOB 

CD 

SHADDAH  KASRAH 

IOC 

CE 

KASRATAN 

lOD 

CF 

SHADDAH  KASRATAN 

lOE 

AO 

Arabic  visible  space 

iOh 

9E 

Arabic  visible  boundary 

110 

8E 

Function  code  8E 

111 

8F 

Function  code  HT 

112 

90 

Function  code  90 

in 

91 

Function  code  91 

lU 

9E 

Function  code  9E 

11=. 

()0(") 

(Reserved) 

116 

00 

(Reserved) 

117 

00 

(Reserved) 

1  IH 

00 

(Reserved) 

ll'* 

00 

(Reserved) 

llA 

00 

(Reserved) 

IIB 

00 

(Reserved) 

lie 

00 

(Reserved) 

IID 

00 

(Reserved) 

HE 

00 

(Reserved) 

IIP 

00 

(Resfrved) 

Ci    A  means  Alone    F  means  Final.  1  means  Initial  and  M  means  Medial 

'■**)    00  means  that  thi-,  displa\   code  is  reserved  and  that  no  reduced  code  is 
associated  to  it  h\  default 
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Display 

Reduced 

Name 

(shape)  (•) 

code 

code 

i:ii 

All 

Arabic  -pait.' 

121 

■M 

Ar.ihk-  ' 

i:; 

•\2 

\r,ihis. 

i:  ■ 

^! 

Arjfk   = 

124 

A  4 

Arabic  S 

12=; 

A  5 

Arabic  "o 

12h 

H3 

Arabic  & 

12" 

H4 

Arabic 

12^ 

AH 

Arabic  ( 

12^ 

A4 

Arabic  ) 

12A 

A  A 

Arabic  * 

12B 

AB 

Arabic  + 

12C 

AL 

Arabic  ,  (numeric  comma) 

12D 

AD 

Arabic  - 

i2r 

Ab 

Arabic  . 

i:r 

Af 

Arabic  / 

13(t 

BO 

Arabic  0 

131 

Bl 

Arabic  1 

132 

B2 

Arabic  2 

133 

B3 

Arabic  3 

134 

B4 

Arabic  4 

13=' 

Bi 

Arabic  5 

13ft 

Bh 

Arabic  h 

137 

B7 

Arabic  7 

138 

B8 

Arabic  8 

pg 

B9 

Arabic  ^ 

13A 

BA 

Arabic  : 

13B 

BB 

Arabic 

13C 

BC 

Arabic  ^ 

13D 

BD 

Arabic  = 

13F 

Bt 

Arabic   > 

13r 

Br 

Arabic  '' 

('):  A  means  Alone    F  means  Final,  I  means  Initial  and  M  means  Medial 

r'j-  00  means  that  this  display  code  is  reserved  and  that  no  reduced  code  is 
associated  to  it  by  default 
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Display 
code 

Reduced 
code 

Name                                                                    (shapet  (*) 

141' 

14: 
u: 

14'^ 
144 
14=^ 
I4r. 
14" 
14h 
14'^ 
14A 
14B 
14C 
14D 
14E 
I4F 

[)r 
Dl 

d: 

D4 
p- 

d: 

D'^ 
D» 
D4 
DA 
DA 
■    DB 
DB 
DC 
DC 

Arabic  ^^.: 
tIAMZAli 

Ml!                                                                       A! 
UA'-LA  ON  AI  if         ■                                  ■  A! 
HAMZAH  LNDtK  ALEF                              AI 
PEH                                                                    A 
PEH                                                                    1 
TA  A  MARBLTA                                               AI 
TA  A                                                                      A 
TAA                                                                      I 
THAA                                                               A 
THAA                                                                   i 
lEEM                                                                  A 
lEEM                                                                  I 
SHEEM                                                              A 
SHEEM                                                              1 

1=^1 
132 
133 

i=;4 

I3S 
ISh 

1?7 
15H 
15'^ 
liA 
ISB 
ISC 
15D 
l=iE 
ISF 

DD 
DD 
DE 
DE 

n 

F2 
F3 
F4 
F5 
Fh 
^^3 
Ft 
^^4 
'J3 

HA  A                                                                  A 
HA  A                                                                  1 
KHAA                                                               A 
KHAA                                                                  1 
DAI                                                                        AI 
LAMALEF                                                            A 
VVASLA  ON  LAMALEF                                   A 
HAMZAH  ON  LAMALEF                              A 
HAMZAH  UNDER  LAMALEF                     A 
MADDAH  ON  LAMALEF                             A 
MEEM                                                                A 
Arabic  [ 
Arabic   ' 
Arabic  | 
Arabic 
Arabic  _ 

(')    A  mearib  Alone    F  means  Final,  I  means  Initial  and  M  means  Medial. 

("i    0(1  means  that  this  display  code  is  reserved  and  that  no  reduced  ccKie  is 
associated  to  it  b\  delauli 


106 


Display 

Reduced 

Name 

(shape) 

•) 

code 

code 

Ibll 

97 

Arabic  ' 

Ihl 

Fh 

MEEM 

I 

16; 

F7 

\oo\ 

A 

163 

F7 

\oo\ 

1 

164 

F8 

HA 

A 

16=; 

AC 

Arabic  text  comma 

166 

A3 

Arabic  x  (multiply  sign) 

167 

A7 

Arabic  divide  sign 

168 

D3 

HAMZAH  ON  ALEF 

AI 

169 

EO 

THAL 

AI 

16A 

00 

Arabic  >  > 

16B 

00 

Arabic  <  < 

16C 

E4 

SEEN  with  compressed  tail 

A 

16D 

E5 

SHEEN  with  compressed  tail 

A 

16E 

E6 

SAD  with  compressed  tail 

A 

16F 

E7 

DAD  with  compressed  tail 

A 

170 

80 

Numeric  space 

171 

82 

Numenc  x  (multiply  sign) 

172 

85 

Numeric  % 

173 

86 

Numenc  divide  sign 

174 

87 

Numeric  ( 

175 

88 

Numenc  ) 

176 

89 

Numenc  + 

177 

8A 

Numenc  - 

178 

SB 

Numenc  ( 

179 

8C 

Numenc  = 

17A 

8D 

Numenc  ) 

17B 

98 

Arabic 

17C 

99 

Arabic  | 

17D 

9A 

Arabic 

17E 

9B 

Arabic  — 

17F 

FF 

Arabic  (DELETE  sign) 

'■*;    A  means  Alone,  F  means  Final,  I  means  Initial  and  M  means  Medial 

f^)-  00  means  that  this  displav  code  is  reserved  and  that  no  reduced  code  is 
associated  to  it  bv  default 
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Display 

Reduced 

•Name 

(shape)  (•) 

code 

code 

ISO 

c: 

SHADDAH 

(linking) 

1>1 

C3 

SLKLN 

(linking) 

is: 

C4 

1  AT  HA 

(linking) 

lt<3 

C3 

SHADDAH  FAT  HA 

(linking) 

184 

Cb 

FATHATAN 

(linking) 

18=> 

C7 

SHADDAH  FATHATAN 

(linking) 

186 

C8 

DAMMAH 

(linking) 

187 

C9 

SHADDAH  DAMMAH 

(linking) 

188 

CA 

DAMMAIAN 

(linking) 

18P 

CB 

SHADDAH  DAMMATAN 

(linking) 

18A 

CC 

KA5RAH 

(linking) 

18B 

CD 

SHADDAH  KA5RAH 

(linking) 

18C 

CE 

KASRATAN 

(linKing) 

18D 

CF 

SHADDAH  KA5RATAN 

(linking) 

18E 

CO 

TAIL 

18F 

CI 

KASHIDA 

190 

Dl 

ALEF 

F 

191 

D2 

VVASLA  ON  ALEF 

MF 

192 

E4 

SEEN  with  compressed  tail 

F 

193 

D3 

HAMZAH  ON  ALEF 

MF 

194 

D4 

HAMZAH  UNDER  ALEF 

MF 

193 

D=. 

MADDAH  ON  ALEF 

AI 

196 

D3 

MADDAH  ON  ALEF 

MF 

197 

Db 

BA  A 

A 

198 

Db 

BA  A 

F 

199 

D6 

BA  A 

1 

19A 

Db 

BA  A 

M 

19B 

D7 

PEH 

F 

19C 

D7 

PEH 

M 

19D 

D8 

TA  A  MARBLTA 

MF 

19E 

D9 

TA  A 

F 

19F 

D^ 

TA  A 

M 

C):  A  means  Alone,  F  means  Final,  1  means  Initial  and  M  means  Medial. 

("):  00  means  that  this  display  code  is  reserved  and  that  no  reduced  code  is 
assoaated  to  it  bv  default 
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Displav 
code 

Reduced 
code 

Name                                                                (shape)  (*) 

1  V' 

]  a; 
1  a: 
1  a; 

1A4 
lA^ 

1  Ah 

1  a: 

1A« 
lA'^ 
lAA 
lAB 
lAC 
IAD 
lAE 
lAF 

DA 

DA 

DB 

DB 

DC 

DC 

DD 

DD 

DE 

DE 

DF 

E5 

EO 

El 

El 

E2 

THA  A                                                                   1 
THA  A                                .                                  M 
11  LM                                                                      F 
ILl.M                                                                      M 
SHELM                                                              F 
SHEE.M                                                              M 
H  A  A                                                                     F 
HA  A                                                                     M 
KHA  A                                                                  F 
KHA  A                                                                  M 
DAI.                                                                       MF 
SHEEN  with  compressed  tail                        F 
THAI                                                                  MF 
KA                                                                      AI 
RA                                                                      MF 
ZAIN                                                                  AI 

IBii 
IBl 
1B2 
IB3 
1B4 
IBS 
1B<S 
1B7 

IBH 

IB^ 
IBA 
IBB 
IBt 
IBD 
IBF 
IBI 

E2 

E3 

EE 

E4 

E4    - 

E4 

E4 

E3 

ts 

E3 

E5 
E6 
E6 
E7 
E7 
F8 

ZAIN                                                                  FM 
SEEM                                                                 AI 
SEEM                                                                 FM 
SEEN                                                                 A 
SEEN                                                                 F 
SEEN                                                                 I 
SEEN                                                                 M 
SHEEN                                                              A 
SHEEN                                                              F 
SHEEN                                                              I 
SHEEN                                                              M 
SAD                                                                   A 
SAD                                                                   F 
DAD                                                                  A 
DAD                                                                  F 
TAH                                                                   AI 

(')    A  means  Alone.  F  means  Final,  1  means  initial  and  M  means  Medial. 

C)    00  means  that  this  display  code  is  reserved  and  that  no  reduced  code  is 
associated  to  it  hv  default 
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Display 
code 

Reduced 
code 

Name                                                                 (shape)  (') 

ICii 
K  i 

ic: 

IC^ 
1C4 
IC- 
IC^ 

ic: 

ICm 

ic? 

ICA 
ICB 
ICC 
ICD 
ICE 
ICF 

L^ 
1  ^ 
LA 
LA 
LA 
EA 
EB 
EB 
EB 
EB 
EC 
.       EC 
EC 
EC 
ED 

7  AH                                                                       Ml 
OH  Ml                                                                   AI 
DHAH                                                                   Ml 
Al\                                  .                                  .    A 
AIN                                                                        F 
AIN                                                                        I 
AIN                                                                        M 
CHAIN                                                                  A 
CHAIN                                                              F 
CHAIN                                                                  I 
CHAIN                                                                  M 
FA                                                                           A 
FA                                                                       F 
FA                                                                       1 
FA                                                                       M 
QAF                                                                    A 

IDO 
IDl 

id: 

1D3 
1D4 
lD=i 
-    IDb 
1D7 

IDH 

ID9 
IDA 
IDS 
IDC 
IDD 
IDL 
IDF 

ED 
ED 
ED 
EE 
EE 

EE 

EE  — 

EF 

EF 

EF 

EF 

FO 

FO 

FO 

FO 

Fl 

QAF                                                                    F 
QAF                                                                    I 
QAF                                                                    M 
CAF                                                                    A 
CAF                                                                    F 
CAF                                                                    AI 
CAF                                                                    MF 
GAF                                                                   A 
GAF                                                                F 
GAF                                                                1 
GAF                                                                   M 
LAM                                                                   A 
LAM                                                                   F 
lAM                                                                       ) 
LAM                                                                   M 
LAM  ALE  (                                                             F 

'■*;    A  means  Alone,  F  means  Fmal.  I  means  Initial  and  M  means  Medial 

("):  00  means  that  this  display  code  is  reserved  and  that  no  reduced  code  is 
associated  to  it  bv  default 
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Display 
code 

Reduced 
code 

Name                                                                (shape)  (*) 

iru 

lEI 

if; 

lE.^ 
1E4 
lE=i 
lEh 
1E7 
'.Eh 
1E9 
lEA 
lEB 
lEC 
lED 
lEE 
lEF 

f: 

F-i 
f4 
F^ 
F6 
Fh 
F7 
F7 
F8 
F8 
F8 
F9 
F9 
FA 
FA 
FB 

V\  ASIA  ON  LAMALEF                                   F 
HAMZAH  ON  LANLALEI                               F 
HAMZAH  UNDER  LA\L\LEF                      F 
MADDAH  ON  LAM  VLFF                             F 
MEEM                                                               F 
MEEM                                                                M 
NOON                                                               F 
NOON                                                               M 
HA                                                                      F 
HA                                                                     1 
HA                                                                         M 
VVAW                                                                    A 
VVAVV                                                                     F 
HAMZAH  ON  VVAVV                                       A 
HAMZAH  ON  VVAW                                     F 
ALEF  MAQSLRA                                              Ai 

IFO 
IFl 
1F2 
1F3 
1F4 
1F3 
IFh 
1F7 
1F8 
1F9 
IFA 
IFB 
IFC 

IFD 
IFE 
IFF 

FB 
FC 
FC 
FC 
FC 
FD 
FD 
FD 
FD 
00 
00 
00 
00 

00 
E6 
E7 

ALEF  MAQSLRA                                            MF 

YA  A                                                                  A 

YA  A                                                      .            F 

YAA                                                                  I 

YA  A                                                                      M 

HAMZAH  ON  YAA                                      A 

HAMZAH  ON  YAA                                      F 

HAMZAH  ON  YA  A                                      1 

HAMZAH  ON  YAA                                      M 

ALEF  (tor  LAMALEF)                                     MF 

WA5LA  ON  ALEF  (for  WA5LA  ON  LAMALEF)  M 

HAMZAH  ON  ALEF  (for  HAMZAH  ON  LAMALEF  )  MF 

HAMZAH  UNDER  ALEF  (for  HAMZAH  UNDER 

l-AMALEF)                                                        MF 

MADDAH  ON  ALEF  (for  MADDAH  ON  LAMALEF)  MF 

SAD  with  compressed  tail                               F 

DAD  with  compres'-ed  tail                              F 

(')    A  means  Alone,  F  means  Final,  1  means  initial  and  M  means  Medial. 

f^')    00  means  that  this  display  code  is  reserved  and  that  no  reduced  code  is 
associated  to  it  bv  default 
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APPENDIX  E 
CODARI,  II,  U  CODE  SETS 


Seven  bit  CODAR  n 


r^t 

7  1 

Dit 

1  ■ 

0 
0   0 

0 
0    1 

0 
1    0 

0 

1    1 

1 
0   0 

1 
0    1 

1 

1    0 

1  t 

CUUMK  II 

© 

© 

© 

© 

0 

© 

© 

0) 

0 

0 

0 

0 

® 

NUL 

OLE 

ESP 

u 

Q|f 

y 

• 

1^ 

0 

0 

0 

© 

SOH 

DC 

I 

1 

ii 

4 

£ 

0 

0 

® 

STX 

DC 

M 

2 

t 

^ 

•  • 

•»   ■ 

0 

0 

® 

ET 

DC 

n 

3 

> 

• 

0 

1 

0 

EOT 

DC 

$ 

4 

e 

• 

•  • 

0 

1 

® 

ENO 

NAK 

s 

5 

0 

>- 

s 

0 

1 

0 

ACK 

SYN 

& 

6 

^ 

\ 
\ 

• 

♦ 

1 

0 

1 

0 

BEL 

ETB 

1 

7 

^ 

0 

0 

BS 

CAN 

1 

8 

i 

1 

• 
» 

• 

0 

0 

HT 

EM 

J 

9 

< 

5 

; 

A 

0 

0 

0 

LF 

SUB 

« 

• 

X 

ft 

• 

; 

5 

•  • 

0 

•    ^ 

0 

vr 

ESC 

♦ 

■ 

* 

1-  1  V 

1 

0 

0 

© 

FF 

FS 

1 

< 

» 

V   'i 

a4 

|i<i 

1 

0 

© 

CR 

GS 

- 

= 

9 

3^ 

^ 

sS 

1 

0 

0 

SO 

RS 

■ 

> 

n 

C 

^ 

-^1 

1 

0 

SI 

US 

/ 

7 

^ 

1 

t, 

QCl    H_^| 

CODAR  II  coding  compatible  with  CCITT  Nr.  5.  The  set  coded  is  the  sub-system  ASV-CODAR/I 
compnsmg  64  characters  for  mformatics  and  data  transmission.  It  was  presented  at  the 
UNESCO/I BI  Conference  at  Bizerte.  1976.  The  ASV-CODAR/2  sub-system  can  be  obtained  by 
eliminating  the  characters  framed  in  heavy  lines. 
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Seven  bit  CODAR  U 


eODAR-U 

0          0          0     1     0 
0   0  1  0    1  1  1    0  I  1     1 

1 

1 

0    0 

1       :       1        1      1 

0    1     I    0   'l    1 

CO*  •  1  «      U-  •  -  1«77 

0 

1    ^ 

'  /       ,         J         \         3, 

■  *l 

■*..     i     i'          0 

0      0      0      0 

0» 

NUL  ;  DU 

ESP        0 

®  ■ 

0 

^    h 

0      0      0       t 

't 

SOH;  DC 

1            1 

>    1     * 

i 

•  • 

0   1  0      I      0 

J 

STX      DC 

M            2 

i 

^ 

6         L 

0      0      11 

'3 

ET        DC 

n        3 

1      k 

0      10      0 

(4 

EOT     DC 

$    1    * 

i 

1 

^     i 

0  ;  1  i  0     1 

a 

ENQ     NAK 

%    ^     5 

^ 
_ 

1 

^  i 

0  i  1  1  I  ;  0 

d 

ACK      SYN 

t     1     6 

•a 

5 

-1 J 

0  j    1 

1       1 

d' 

BEL      ETB 

'     1     7 

y 

^ 

A 

1  1  0      0  1   0 

,'V' 

as        CAN 

(    ;  8 

^ 

ji 

li 

1  1  0      0  '    1 

(•' 

HT        EM 

)     1     9 

i 

1 

c 

;  [  A 

10      10 

(*'■ 

LF      !   SUB 

*    1     : 

L     ^ 

3   !   9 

1     0      1  1    1 

(i- 

VT        ESC 

*  i    ; 

i 

Jl 

AAi 

^ 

1       1  {  0      0 

© 

FF        FS 

'  !  < 

ti 

% 

^ 

* 

1 

0       1 

@ 

CR        GS 

- 

i 

-^ 

(^ 

1 

1      0 

1 

® 

SO    '  RS 

1 

1 
•  1  > 

• 

t   ^  ^ 

1 

1  i  1 

d' 

SI         US 

/    ? 

# 

1       h   "' 
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APPENDIX  F 
FINAL  CODE  U-F.D. 


FINAL  CODE 
CODAR  U-F.D. 

Recommendation  of  the  final 

Meeting  Held  In  Rabat  (Morocco) 

In  22-24  April  1982 
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FOREWORD 

The  importance  of  the  role  of  the  information  channels  in  the  Arabic  world  is  becoming  increasingly 
obvious  in  all  sectors.  All  Arabic  countries  are  dealing  with  various  types  of  information  m  the  fields  of 
administration  organization,  planning,  science  and  technology. 

The  simple  concept  of  cooperation  between  the  Arab  countries,  and  the  positive  results  ol' 
standardization  make  it  necessary  to  introduce  a  unified  cipher  for  the  .Arabic  characters  used  in  the  field 
ot  information  exchange. 

In  this  connection  the  concerned  Arabic  organization  have  taken  considerable  measures  such  as  the 
two  meetings  which  were  held  in  Rabat  (Morocco);  the  first  meeting  was  the  ( Arab  e.xperts  conference  for 
the  unified  .Arabic  cipher  in  the  field  o(  information).  It  was  held  with  the  cooperation  oi  the  (Arabic 
Institute  for  Researches  and  Arabization)  during  the  period  between  25th-29th  Sept..  1980.  The  second 
meeting  concerned  with  the  regulation  of  the  .Arabic  cipher  in  its  final  shape  and  was  held  on  April  22-24. 
1982.  In  this  meeting  the  technical  committee  did  achieve  the  projected  corrections,  and  the  .Arabic 
cipher  which  is  known  as  (CODAR  U.F.D.)  was  ready. 

Attached  are  the  reasons  for  modification  of  the  COAR-UF.D..  the  recommendations  adopted  at 
the  meetings  and  the  final  shape  of  the  unified  Arabic  cipher  which  will  be  formed  in  an  Arabic  standard. 
This  standard  will  be  distributed  to  the  ASMO  member  bodies  tor  further  studying  and  approval  as  a 
prelude  to  the  actual  experimentation  and  application. 


RECOMMENDATION 

In  the  final  session  and  with  a  group  agreement  of  the  conferees  on  the  final  shape  oi  the  unified 
Arabic  cipher,  the  following  recommendations  have  been  adopted: 

(1)  The  conference  requests  the  Arab  League  Education  Culture  and  Science  Organization 
(ALECSO)  and  .Arab  Organization  for  Standards  and  Metrology  (ASMO)  to  adopt  the 
Arabic  cipher  which  has  been  agreed  upon,  and  take  all  necessary  measures  tor  its  adoption 
and  enforcement  in  all  .Arabic  countries. 

(2)  The  conferees  recommend  to  the  information  organization  that  use  Arabic  language  to 
experiment  the  new  cipher  betore  enlbrcement. 

These  recommendations  shall  be  submitted  in  particular  to  the  (Institute  for  Research  and 
Studies  for  .Arabization)  in  Morocco,  the  Saudi  .Arabian  Standards  Organization  and  the 
.National  Center  for  Information  in  Tunisia  for  the  purpose  oi'  testing  the  new  cipher  before 
the  next  (.ASMO)  meeting. 

(3)  It  IS  recommended  that  the  .Arabic  cipher  in  its  new  and  final  share  be  adopted  by  the  Arabic 
association  for  telecommunications. 

(4)  It  IS  also  recommended  that  .ALECSO.  the  ASMO  and  the  Arab  association  for 
telecommunication  shall  make  necessary  coordination  to  use  Arabic  language  in  the  field  ot 
information  between  them  and  other  international  organizations  bodies  and  the  UNESCO. 

(5)  The  meeting  recommends  an  emergency  session  ALECSO  and  ASMO  to  regulate  the 
specifications  oi'  the  devices.. the  printing  letters  and  their  forms  and  to  find  the  best  way  oi 
utilizing  computers. 

(6)  The  meeting  also  recommends  the  continuous  contact  between  ALECSO  and  ASMO  to  see  to 
the  best  execution  of  these  recommendation. 
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CODAR  -    U/FD 

Codaee  arabe  unifie  forme  definuive 
(RABAT  ::-  24  Avril  1982) 


i^lAuJI  ^g^  vt*  «-^''y'J'  't^/^^l  i/JLJ^ 
(1982  ^ff\2A  .  22  bU^l) 


Reunion  Alecso  -  Asmo 

sur  la  mise  au  point  et  la  normalisaiion 
du  Codar  -  U. 
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APPENDIX  G 
ASMO ' S  APPROVED  ARAB  STANDARD  SPECIFICATIONS 


^smo 


ARAB  STANDARD  SPECIFICATIONS 

449 

Data  processing  -  7  -  bit  coded  Arabic  Character  set  for  Information  Interchange 


ARAB  LEAGUE 

ARAB  ORGANIZATION  FOR  STANDARDIZATION 
AND  METROLOGY  (  ASMO  ) 
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Preface 


This  Arabic  Standard  was  prepared  by  technical  committee  No.  8  (Arabic  characters  in  informatics). 
Among  the  parties  who  participated  in  its  preparation  are  the  Arab  League  Educational,  Cultural, 
and  Scientific  Organization  (ALECSO),  and  the  Institute  of  Studies  and  Research  for  Arabization  in 
Morocco. 

In  accordance  with  the   1982  Directives  for  the  Technical  Work  of  the  Arab  Organization  for 
Standardization  and  Metrology  -  Part  I;  Procedure  and  Working  Methods  -  this  Arabic  Standard  was 
adopted  by  the  resolution  of  the  General  Assembly  of  ASMO  No: 
(  R  342  /  G.A.  /  S  15  -  October  21,  1982  ). 
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DATA  PROCESSING:  7-BIT  CODED  ARABIC 

CHARACTER  SET  FOR  INFORMATION 

INTERCHANGE 

0.  INTRODUCTION 

This  Arabic  Standard  specifies  the  properties  of  a  coded  character  set  using  7-bit  binary'  codes  for 
information  interchange  among  different  types  of  data  processing  equipments  using  the  Arabic 
characters.  It  also  specifies  a  set  of  control  and  graphic  characters,  in  addition  to  its  coded 
representation  inspired  from  ISO  646.  The  set  of  specific  graphic  characters  in  this  standard 
enable  us  under  all  circumstances  to  represent  Arabic  text  whether  it  is  totally  vowelized, 
partially  vowelized,  or  unvowelized.  This  standard  provides  the  possibilities  for  information 
interchange  for  special  applications,  as  well  as  the  possibilities  for  expansion  in  case  of 
insufficiency  of  the  coded  character  set.  This  Arabic  Standard  was  made  in  accordance  with  ISO 
646.  and  the  following  points  were  modified  so  that  the  standard  ISO  646  is  convenient  for 
Arabic  usage: 

—  Table  I. 

—  Comments  on  this  table. 

Table  I  was  modified  in  such  a  way  which  permits  the  usage  of  the  coded  character  set  as  a 
separate  group  from  the  Latin  character  set  described  in  ISO  646  for  information  interchange, 
and  the  usage  of  basic  programs  in  Arabic  Language  for  the  purpose  of  complete  Arabization 
when  using  computers.  This  table  also  allows  the  usage  of  the  coded  character  set  together  with 
the  Latin  character  set  as  in  the  International  Standard  ISO  646  because  of  the  correspondence 
between  these  two  standards. 

Applying  this  standard  requires  several  application  standards  to  be  implemented  on  a  carrier 
(magnetic  carrier;  transmission  network,  etc.).  and  these  applications  are  specified  in  other 
standards. 

1.  SCOPE  AND  FIELD  OF  APPLICATION 

1.1  This  .Arabic  Standard  contains  a  set  of  128  characters  (control  characters  and  graphic 
characters  such  as  letters,  digits  and  symbols)  with  their  coded  representation.  .VIost  of 
these  characters  are  mandatory  and  unchangeable,  but  provision  is  made  for  some 
flexibility  to  accommodate  special  national  and  other  requirements. 

1.2  The  need  for  graphics  and  controls  in  data  processing  and  in  data  transmission  has  been  taken 
into  account  in  determining  this  character  set. 

1.3  This  Arabic  Standard  consists  of  a  general  table  with  a  number  of  options,  notes,  a  legend  and 
explanatory  notes. 

1.4  This  character  set  is  primarily  intended  for  the  interchange  of  information  among  data 
processing  systems  and  associated  equipment,  and  within  message  transmission  systems. 
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1.5  This  character  set  is  applicable  to  all  Arabic  alphabets. 

1.6  This  character  set  includes  facilities  for  e.xiension  where  its  128  characters  are  insufficient  for 
panicular  applications. 

1.7  The  definitions  of  some  control  characters  in  this  .Arabic  Standard  assume  that  data  associated 
with  them  is  to  be  processed  serially  in  a  forward  direction.  Their  effect  when  included  in  strings 
of  data  which  are  processed  other  than  serially  in  a  forward  direction  or  included  in  data 
formatted  for  fixed  record  processing  may  have  undesirable  effects  or  may  require  additional 
special  treatment  to  ensure  that  the  control  characters  have  their  desired  effect. 

2.  IMPLEMENTATION 

2.1  This  character  set  should  be  regarded  as  a  basic  alphabet  in  abstract  sense.  Its  practical  use 
requires  definitions  of  its  implementation  in  various  media.  For  example,  this  could  include 
punched  tapes,  punched  cards,  magnetic  tapes  and  transmission  channels,  thus  permitting 
interchange  of  data  to  take  place  either  indirectly  by  means  of  an  intermediate  recording  in  a 
physical  medium,  or  by  local  electrical  connection  of  various  units  (such  as  input  and  output 
devices  and  computers)  or  by  means  of  data  transmission  equipment. 

2.2  The  implementation  of  this  coded  character  set  in  physical  media  and  for  transmission,  taking 
into  account  the  need  for  error  checking,  is  the  subject  of  other  ISO  publications. 
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Table  (I) 


p 

0 

'^ 

g 

T- 

1 

1 

0 

0 

1 

1 

"T 

6 

1 

0 

1 

0 

1 

0 

1 

0 

1 

0 

1 

2 

3 

4 

5 

6 

7 

Dk 

b, 

D- 

b 

0 

0 

0 

0 

0       NUL 

TC. 

SP 

0 

a 

• 

— 

^ 

0 

0 

0 

1 

1         TC. 

'SC-' 

DC. 

1 

• 

1 

fr 

J 

• 

w 

0 

0 

1 

0 

2        TC. 

DC, 

II 

2 

T 

• 

J 

i 

« 

0 

0 

1 

1 

3     If.-, 

DC. 

1 11  i 

3 

■\ 

— 

r 

® 

0 

^ 

0 

0 

^       IF: 

DC. 

n 

4 

• 

J 

€) 

0 

1 

0 

1 

5    15= 

TC. 

% 

5 

1 

-^ 

^ 

(D 

0 

1 

1 

0 

^  1  •^■' 

TC. 

& 

6 

» 
^ 

• 

•j 

J 

0 

1 

1 

1 

7       BEL 

TC. 

f 

7 

1 

L 

^ 

31 

0 

0 

0 

8         FE. 

CAN 

yi 

8 

Ji 

J 

© 

0 

0 

1 

9        fE 

EM 

& 

9 

•* 

d 

^ 

v^ 

€> 

0 

1 

0 

10    r^ 

'      1  > 

SUB 

• 

• 
• 

J 

• 

•• 

© 

0 

1 

1 

11 

H 

ESC 

+ 

• 

J 

H 

* 

3 

j 

1 

0 

0 

12 

K? 

IS. 

(•S) 

>a 

• 

\ 

r» 

i 

1 

1 

0 

1 

13 

\7& 

IS. 

- 

>- 

c<^ 

^ 

^ 

1 

1 

0 

14    so 

IS, 

• 

<« 

• 

A 

" 

1 

1 

1 

15    SI 

IS 

/ 

• 

^ 

- 

f 

DEL 

(T)    See  Note    (U 
Q    See  Note   (J 


(2)   See  Note   Q) 
(3D   See  Note    0 
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NOTES  ABOUT  TABLE  1: 

1)  The  format  effectors  are  intended  for  equipment  m  which  horizontal  and  vertical  movements  are 
effected  separately.  If  equipment  requires  the  action  of  CARRIAGE  RETURN  to  be  combined 
with  a  vertical  movement,  the  format  effector  for  that  vertical  movement  may  be  used  to  effect  the 
combined  movement.  For  example,  if  NEW  LINE  (symbol  NL,  equivalent  to  CR+LF)  is 
required.  FE2  shall  be  used  to  represent  it.  This  substitution  requires  agreement  between  the 
sender  and  the  recipient  of  the  data. 

The  use  of  these  combined  functions  may  be  restricted  for  international  transmission  on  general 
switched  telecommunication  networks  (telegraph  and  telephone  networks). 

2)  The  symbols  7*/  and  locations  2/3  and  2/4  are  used  respectively  to  denote  NUMBER  SIGN  and 
CURRENCY  SIGN.  Note  that  the  character  do  not  designate  the  currency  of  a  specific  country 
unless  otherwise  agreed  upon  between  the  sender  and  the  recipient  of  data. 

3)  These  positions  are  imended  for  national  use  or  for  alphabet  extension.  If  not  used  for  such 
purposes,  they  may  be  used  for  representing  symbols  which  do  not  have  specific  functions.  This 
requires  agreement  between  the  sender  and  the  recipient  of  the  data. 

For  the  general  case  of  information  interchange  among  computers,  these  positions  shall  not  be 
used. 

4)  Positions  and  names  of  special  signs  which  have  specific  functions  in  the  code  table  is  the  same  as  in 
ISO  646.  However,  such  signs  should  be  imaged  and  printed  according  to  text  as  shown  in  the 
following  Table. 
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APPENDIX  H 
PROGRAM  CODE 


Program  Lex i cal _Transl ator ( i nput , output )  ; 


*) 


File  Name 

Module  name 

Author 

Date  created 

Last  change 

Calls 

Open_Fi 1 e 


Lex  i  cal . pas 

Le;-;  i  cal  _Transl  ator 

Sadek  Sal  eh  AL-Juhaiman 

April  4,  1936 

Aug  4,  1986 

=  Gets  the  source  -file  name 


md 


Initialize 
Fill  Buf+er 


initialize  the  Output  files. 
To  initialize  the  hash  table  and  global 
var i  abl es- 
_  Fill  the  line  buf-fer  and  increment  the 

1 i  ne  no. 
Bu-f -f  er_Empty=  Check  i -f  the  line  buf-fer  was  consumed. 
Token_Arid__Type    ~  Get  the  next  token  and  its  type. 
Map_ Iden_To_Lat i n=  Search  for  the  identifier  in  the 

symbol  table.  If  not  predefined 
then  insert  it  . 
Lat i n_ Integer  =  Map  integer  tokens  to  Latin  integers. 
Speci al _Character-  Map  special  characters  to  Latin 

equivalent  character. 
Control _Char      =  Notifies  the  presence  of  escape 

codes. 


Called  by 
Incl ude  f  i 1 es 


None 
Resource. pa? 


Var i  abl es 
Li  ne 

Ne;<t_Loc 
Token 
Tok_Type 
Tok_Len 
Li  ne_No 
Debug _0n 

Comment  On= 


= Input  line  buffer. 

=Points  at  the  first  char  of  next  token. 
=Buffer  of  255  character. 

=Types  of  the  token  present  in  token  buffe?r 
=The  length  of  the  token  in  token  buffer, 
=Source  code  line  number. 

=E-!oolean  variable,  debugging  feature,  set 
by  Arabic  directive  in  the  source  code. 


Res_Word   = 

Match_Ind  -- 
Int_Str 
L  i  n  e       - 
Next  Loc   •■= 


Token 
Lat i  n_Id 
Hash 
ArabicSpel 


Directive,  to  include  thie  comments  in 
the  generated  output. 

Array  of  records  for  the  reserved  words. 

contains  the  Arabic  and  its  English  m^^tch 
--  Index  in  Res_Word  array  to  token  location, 
=  Integer  string  of  size  10  characters. 

-  Input  line  buffer. 
•■=  The  first  character  of  the  next  token  in 

the  1 i  ne  buffer 
•~  Token  buffer. 

-  The  mapped  identifier  (  in  Latin  forrri  ). 
=  HashTable; 

1  =  Spelling  string  array  of  5000  chars. 
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characters 
Li  ne_No 
Li  ne_Si  ze 
Match_Ind 

Ideri_NQ 

Lat i  n_Char 

Lat_Int 
Error  Set 


=  Number  o+  chars  in  spelling  table. 

=  Counts  the  read  source  lines. 

==  Line  buf-Fer  upper  limit. 

=  Index  o-F  reserved  word  -found  in  the 

constant  array. 
=  The  number  of  the  identifier  in  the 

sequence  of  arrival. 

One  character  buffer  for  special 

characters. 

The  integer  translation  to  Latin. 

Token  error  set. 


Comment  : 

The  program  will  ask  for  input  source  file  with  or 
without  extension  .  IF  the  name  is  valid  it  will  open 
the  file  and  initialise  tow  out  "put  files.  The  two 


files  will  have 


ame  file  name  and  the  extensions  DIC 


and  PAS,  The  DIC  file  has  all  userdefined  identifiers 
with  their  assigned  Id_Numbers,  Thie  PAS  file  will  have 
the  generated  PASCAL  code. 


After  initialization  the 


program  will  take  one 
The  token  is  given  a 
a  translation  module  will 


line   and  break  it  to  tokens, 
type,  then  be^sed  on  the  type 
be  cal 1 ed . 

The  above  will  continue  for  each  line  of  code  until 
a  major  error  is  encountered.  Major  error  will  result 
from  long  t.oke?ns  when  using  comments  or    literal  string 


(  *  •**  Jf  ** -X- * -J^  *•* -^  •«■* -^  ^f  •**■*■»■  *-J4-#-* -if  i^  *•!<■** -if  jf**-^  *) 


124 


CONST 

Ma;;_Arb_Word  =12; 
Max _Lat_ Word  ^12; 
MaK_Len       =255; 
Res_Words     =59; 
Max Key   =  6310; 
Max Char  =  5000; 

TYPE 


size  of  Arabic  word  > 
size  of  Latin  v-^jord  J- 
line  fk  literal  size  > 
reserved  words  size  > 
Prime  number,  hashing  J- 
Size  of  spelling  table J 


Line_Range     =  O..Max_Len; 
Arab_Word_Str  =  str  i  ng  CMax__Arb_Word  D; 

•[  max  char  per  Latin  word  > 


Latn  Word  Str 


strinqC  Max  Lat  Word  II; 


Word  Rec  =RECORD 


C  constant  array  record 
C  of  reserved  words 


English:  Latn_ Word_Str ; 
Arabic  :  Ar ab_Wor d_Str ; 
END; 

Reserved_Index=  1  . „  Res_Words; 

Words        =  array   C  Reserved_ Index  1    OF  Word_Rec; 
Lati n_Token  -=  string  [63;  C  string  in  the  form  id_000   > 
WordPointer  =  •"■WordRecord ;  C  Pointer  to  user  defined  id> 
WordRecord   =   RECORD      [  for  user  defined  iden.     J- 

Index,     C  identifier  number  sequenceJ- 
Lenth,     C  Length  of  the  word  .        > 

C  Location  of  the  word  last-] 

C  character  in  symbol  table.} 
LastChars  integer; 

L  link  pointer  to  next  word  J- 
Next Word:  WordPointer; 

•[  assigned  identifier  number} 
Id:  Latin_Token; 


Hash Tab 1 e 
Spel 1 i  ngTabl e 
Ln..Str 
Token_Str 
Errors 

Types_CJf  _Token 


Lati  n 
END; 
=  array  CI  ..  MaxKey  1    OF  WordPointer 
array  CI  „»  MaxcharH  OF  char; 
str i  ng  CMax_Len  3 ; 
string  IIMax_LenIl  ; 
(  Long_TDken  ,  Long  Comme?nt , 

Long_Li  teral _Str ,  111 egal _Char )  ; 
(  EO.  anks  ,  1 1 1  egal  ,  Reserved __Word  , 
Li  teral  _Str  ,  Contr  1  _.Cod  ,  Unc  1  sf  d  , 
I dent i  f  i  er , Coment , Integer  1 ^ 
Funct_Operatar  ) ; 

■C  Arabic  characters  range 
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Arbic_Alph 
StrlO 


•C    -from       80    He;-; 
set    o-f       *aO    .  .     *FF    ; 
stri  ng  C 1 0] ; 


to       FF    Hex 


126 


CONST 

r 

res_wDr 

d : words  = 

( 

(engl i  sh : 

' absol ute ' 

(enql i  sh : 

'  and  ■' 

(engl i  sh : 

'  array ' 

(engl i  sh: 

'  begi  n ' 

(engl i sh : 

'  case ' 

(engl i  sh : 

'  const ' 

(engl i  sh : 

'  d  i  V  ' 

(engl i  sh : 

'do' 

(engl i  sh : 

'  downto ' 

(engl i  sh : 

' el se ' 

(engl i  sh : 

'  end  ' 

(engl i  sh : 

' external ' 

(engl i  sh : 

'text ' 

(engl i  sh : 

'  -Forward  ' 

(engl i  sh: 

'-for  ' 

(engl i  sh : 

'  f  Lincti  on  ' 

(engl i  sh : 

'  goto ' 

(engl i  sh : 

'  concat ' 

(engl i  sh : 

'inline' 

(engl i  sh : 

'if' 

(engl i  sh : 

'in' 

(engl i  sh : 

'label ' 

(engl i  sh : 

'  mod  ' 

(engl i  sh: 

'nil  ' 

(engl i  sh : 

'  not.  ' 

(engl i  sh : 

' over 1  ay ' 

(engl i  sh : 

'  o-F  ' 

(engl i  sh : 

'  or  ' 

- 

(engl i  sh: 

'  packed  '— 

(engl i  sh : 

'  procedure 

(engl i  sh : 

'  program ' 

(engl i  sh : 

'  record ' 

(engl i  sh : 

' repeat ' 

(engl i  sh : 

'set  ' 

(engl i  sh : 

'  begi  n ' 

(engl i  sh: 

'shl  ' 

(engl i  sh : 

'  real 

(engl i  sh : 

'  i  nteger ' 

(engl i  sh: 

'  bool ean ' 

(engl i  sh : 

'  read ' 

(engl i  sh : 

readl n ' 

(engl ish : 

'  wr i  te ' 

(engl i  sh : 

' wr i  tel n  ' 

(engl i  sh : 

'end  ' 

(engl i  sh : 

'shr  ' 

(engl i  sh : 

' str i  ng  ' 

(engl i  sh : 

' then  ' 

(engl i  sh : 

' type ' 

(engl i  sh : 

'to' 

resource  file  contains  the 


; arabi  c : 

^i-l-b-o  '  )  , 

; arabi  c : 

^r-Ji_d_iLJ:.ll  •• 

) 

; arabi  c : 

USXT.  '  )  , 

; arabi  c : 

d_^  1  3_,  ■■  )  , 

; arabi  c: 

d_J  L^-  '  )  , 

; arabi  c : 

il.^  LJ-  '  )  , 

; arabi  c : 

cL-O.U.'jj  '  )  , 

; arabi  c : 

v_t-i  i  '  )  , 

; arabi  c: 

^_l  1  ^i....1  ■■ ) 

^ 

; arabi c: 

■^Jij'> , 

; arabi  c: 

^-:^  '-4-'  ■■  >  ■> 

; arabi  c : 

iS^r^  L^-  '  )  , 

; arabi  c: 

usJ_o  '  )  , 

; arabi  c : 

J::--J')  ,  ' 

; arabi  c : 

sUr^')   , 

; arabi c: 

<1_S.__.J;?_5  •■  )  , 

; arabi  c : 

•  ^^   l_V-SiJ>i  '  ) 

9 

; arabi  c : 

■  i..J-.-->  J  ■■  )  , 

; arabic: 

'  _;i.->.u..J  L_.  '  )  , 

; arabi  c : 

iii  ^)  , 

; arabi  c  s 

..]->  1  -W.  '  )  , 

; arabi  c : 

■  'i.^S  .J  '  )  , 

; arabi  c : 

yS'^  L_.  •  )  , 

; arabi  c : 

r^-'^  '  )  , 

; arabi  c : 

U.^-  '  )  , 

; arabic : 

^  Lb^  '  )  , 

; arabi  c : 

Jt  '  )  , 

; arabi  c : 

3^   •■)  , 

; arabi  c : 

b_9JtJ^-o  '  )  , 

; arabi  c : 

-■i^-^JO'   ■   )  , 

; arabi  c; 

ft"  '-—'  --H'  ■  )  n 

; arabi  c : 

<^jJ>  '   )  , 

; arabi  c : 

1^\    ■)   ,    ^ 

5  arabic: 

<_!:.  _Ci_o— >-o  '  )  ; 

; arabic: 

<1-^;  1  Js-J  '  )  , 

; arabi  c : 

_,  L.^...^:  .  f  '  )  , 

; arabi  c : 

<S_j^'^  '  )   , 

; arabic : 

-?-— -^ '  )  , 

; arabi  c : 

,^i±>^.-o    ■■    )    , 

; arabi  c : 

1  -J-i  i ' )  , 

; arabi  c: 

jk-jjA  ,j^  [''>•, 

; arabi  c : 

u,_..i  1  '  )  , 

; arabi  c : 

jlz-JJ.-     V--'-*  i  ■'  ) 

s 

; arabic : 

<-:>.   <-4~'  '  )  , 

; arabi  c : 

^J-:-'^-:'.  _  £•  ■'  >  1 

; arabi  c : 

d._Lu.'_Li.'.' '  )  , 

; arabi  c : 

^^:>-^-^  '  )  , 

; arabi  c : 

3   1  .A-  '   )  , 

; arabi  c : 

cS^J  i  '  >  , 
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(engl i  sh :  ' unt i I 
(engl  i  sh  :  '  \'a.r  ' 
(engl i  sh :  ' str 
(engl i  sh:  '  chr  ' 
(engl i  sh :  '  ord  ' 
(english: 'while' 
(english: 'input' 
(english:  output' 
(engl i  sh:  ' with  ' 
(engl i  sh :  ' xor  ' 


; arabi  c : 

tS~'~'      '   ? 

; arabi c : 

'  j-:t^---^  " 

) 

; arabi  c : 

'  M->--c  ' 

) 

; arabi  c : 

'>-»_»-=- _.>J ' 

) 

; arabi  c : 

'uJ_j_-__,  ' 

) 

; arabi  c: 

'  L-o_:_,.^  ' 

) 

; arabi  c: 

■J3^--i') 

1 

; arabi  c 

=  '  ej->^- ' 

) 

f 

; arabi  c : 

'  2-0  '  )  , 

; arabi  c : 

'  <->-■•—• 5  1 

' 

)>; 

Arabic_Alph  :  Arbic 
C  *B0  . .  *B9 , 
*D0  ..  *FD, 

*C0  1 ; 


Alph 


Arabic    digit 
Arabic    letters 
under    score 
tail    'genration 


Del i  mi  ters 
C    #$80 

tt*8F 
tti-90 
#$91 
#*9:^ 

#*94 
#*95 
#*97 
#*A0 
tt*A3 
#4--A6 
#:fA7 
#*A3 
tt*A9 
#*AB 
#fAC 
#:tAD 
tt-^AE 

#*BA 
#*BC 
#*BD  . 
#*BE ' 1 ; 


SET  OF  char  =  Cconst  set 
•C.  Space 


del i  mi  ters 


■C  BCON  -function  code 

•C  BCON  -function  code 

•C  BCON  -function  code 

•C  BOON  function  code 

•[  Array  le-ft  square  brae 

•C  Latin  space 

•C  Array  right  square  bra 

•C  Arabic  up  arrow  "point 

•C  Arabic  reverse  apostro 

•C  Arabic  Space 

•C  Arabic  mu  1 1 i  p  1  y 

•C  Arabic  period 

•C  Arabic  divide 

■C  Arabic  left  parenthesi 

•C  Arabic  right  parenthes 

•C  Arabic  plus  sign 

■C  ARABIC  comma 

r.  Arabic  minus 

■C  numeric  comma  used  as 

•C  the  Latin  decimal  dgt 

•C  Arabic  colon 

C  Arabic  greater  than 

•C  Arabic  equal    sign 

•C  Arabic  less     than 


cket 


cket 

-• 

J 

er" 

J 

phe 

J 

y 

> 

\ 

1  s 
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VAR 

Debug_On  :  boolean; 

Comment_On  :  boolean; 

Tok.__TypG  ;  Types_0-f  _Token  : 

Tol<_Len  :  Line_RangG; 

Int_Str  :  string  CIO]; 


I  :  integer; 

Line  :  ln_Str; 

Ne;;t_Loc  :  Li  ne_Range; 

token    .  :  tDken_Str; 

Latin_Id  :  Latin_Token; 

Hash  :  HaBhTable; 

ArabicSpell  :  Spel 1 i ngTabl e; 

Characters  :  integer  ; 

Line_No  :  integer; 

Line_Bi2e  ;  Line_Range; 

Iden_Ma  :  000  . .  999; 

Match_Ind  s  Reser ved_ I ndex ; 

Latin__Char  :  char; 

Lat_Int  :  striO; 

Errar_Set  s  SET  OF  errors; 

OutFile  :  text; 

InFile  s  text; 

Dictionary  ;  tent; 

Procedure 

OPEN_.FILE; 

VAR 

valid   :  boolean;         C  for  I/O  error  W/  -file  name  > 

F_Name ,  C  -file  name  with  no  G>;tensian> 

File_Nafrie  :  str  i  nq  C  121 ;  C  tile  name  from  key  board.  J- 

ind        :  integer; 

BEGIN 

val  id  ■:  ~    f  al  se; 

WRITELN  (  '  Inpt.it_File  name:  '  )  ; 

REF'EZAT  C  until  valid  -file  name   > 

READln   (File_Name); 
ASS  I GN  ( I  nFi  1  e  ,  Fi  1  e,_Name )  ; 

C-fl--}  -C  i -f  no  error  opening   tileJ- 

RESET  (  I  n-f  i  1  e )  ;         -C  then  file  ex  i  st  J- 

■r.$I  +  >  C  if  no  I/O  error,  its  valid: 

valid  :=  (  lOresult  -  O) ; 
CI rScr ; 

if  not  (valid)  THEN 
BEG  I  N 

WRITELN(-  *■*  FAILURE  TO  OPEN  FILE  =--~-^^     ', 

Fi 1 e_Name  ) ; 
L'JRITELN('     Please  RE_ENTER  Input  File  name 
END; 
UNTIL  VALID; 

i  n  d  s  =  1  : 
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REPEAT  C  get  the  name  W/0  extension  '. 

F_Name ( . ind . )  s=  Fi 1 e_Name ( . i nd . ) ; 
i  nd  : =i  nd  +  1 ; 
UNTIL  (File_NamG(. ind. )='  ')  OR 
(Fi  1  e_Nafrie  ( .  i  nd  .  )  =  '  .  '  )  DR 
(  ind  >  LENGTH  (File_Name)  ); 
F_Name(.0.)  :=  CHR(ind-l); 

ASSIGN  (out-f  i  le  ,F_name+ '  .  pas  '  )  ;  C  translator  output 
ASSIGN  (di ct ionary ,F_Name+ ' . di c ' ) ;  C  dictionary  File 
.RESET   (infile); 
REWRITE  (out -file)  ; 

REWRITE  (di  ct  i  onary)  ;     C  -file  contains   identi-fiers 
END?  C  and  their  translations 
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F'rocedure 

INITIALIZE;  C  I ni  t. i  al  i  zg;  the  hash  keys 

VAR  C  and  the  global  variables 

Key No  :  integer; 
BEG  IN 

Debug_On   n=  -False; 
Comment_On :  =  -false; 
Error__Set :  -C  ]  ; 
Line_No  :  =-•  0; 
Id€?n_No    s  =  0  ; 
KeyNo      : =  1  ; 
WHILE  KeyNo  <>  Max Key  DO 
BEGIN 

hash ( .  KeyNo  . )  : =  nil; 
KeyNo  : =  KeyNo  +  1  ; 
END;  ' 
characters  :=  O  ;        {  count  o-f  chars  in  spell  tbl 
END;  ■[  initially  > 

PROCEDURE 

FILL_BUFFER 

(   VAR  line  :  ln_Str;      C  input  line  buffer 

VAR  where  :  1 i ne_range;  C  location  in  bu+fer 

VAR  line__no  :  integer; 

VAR  Ln_Gize  :  line__range 

) ; 

BEGIN 

RE ADLN  (  i  n -f  i  1  e  ,  1  i  n  e )  ; 

Line_No  :  ==  Line_No  +  1; 

IF  Debug_On  THEN  UJRITELN  (1  i  ne)  ; 

IF  (line=  '  {-i-L'»^-_Loj-  '  )  THEN  C  set  comment  di  recti  s'e 

BEGIN 

Comment_On:=  true; 

READLN (int i le, 1 ine) ; 

1 i  ne_  Ho    : =  Li  ne_No  + 1  ; 
END; 

IF  line  =' {--b^^i-oj-  '   THEN 

BEGIN  i    reset  comment  directive 

Comment_On:=  false; 

READLN (infi le,l ine) ; 

line_No  s=  Line_Na  +1  ; 
END; 

IF  line  ■=  '  Z-^iSj-^-^'J  '        THEN   C  set  debug  directive 
BEGIN 

Debug _0n   : =  true  ; 

READLN ( i  nf  i 1 e , 1 i  ne )  ; 

line_No  :  =■  LineNo  +1  ; 
END; 

IF  line  -= '  ■i-t5_^^■^^s  '       THEN  i    reset  debug  directive 
BEGIN 
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Debug_On   :=  false; 

READLN ( i  nf  i 1 e , 1 i  ne )  ; 

line_No  :-    Line_No  +1  ; 
END; 

where    :=1;  C  initialize  linepointer 

Ln_Size  :=  length(line);    C  line  size 
END; 
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FUNCTION 
E!UFFER_EMPTY 
(    Ne>;t_Loc     : 
Ln    Size       : 


1  i  ne__range; 
1 i  ne_range 


BEGIN 

BUFFER_EMPTY 
END: 


) : BOOLEAN; 

C  check  i -f  butter  is  empty 
=  (  ne;;t  1  oc  >  Ln  Size)  ; 


FUNCTION 

EMPTV_ERROR_SET :  BOOLEAN ; 

■C  It  error  set  is  empty  then  no  errors  3.re    -found 
yet.  translation  will  continue 


BEG  I N 

EMPTY_ERROR_SET 
END; 


=  (ERROR  SET  =  CI) 


Procedure  ■ 
TOKEN_AND_TYPE 
(  VAR  where 
VAR  token 
VAR  Tok  Len 


:  1  i  ne_range;  •[  location   o-f  next  token  > 

: taken_Str ; 

:  1  i  ne__range;  C  length  ot  resulted  token> 
VAR  Tok__Type  ,■;  Types__0-f  _Token ;  i    Token  type  > 

VAR  Match  I nd: reserved  index C  index  of  res.  words    J 


•[   ****-^***  **********  ****-^#i*/***********-S--)i-*  **********       > 


module  name 
date  created 
cal  1  s 


TOKEN_AND_TYPE 

April  7,  1936 

Eilanks,  Comments,  Li  t era  1  _Stri  ng  , 

Integer_Tak  ,  Ident  i  f  i  er__Tok  , 

Reser  ved__Tok  ,  Speci  al  _Char  , 

Control  Char 


called  by     :  MAIN 
variables     : 
last  change   :  Aug  5,  1986 
Comment       : 

procedure  collects  the  tokens 
Tokeen  Typje   names  to  them. 
■C  *************************************************  > 


and  assigned 


VAR     index 

ch 
CONST   digits 


s  integer;    C  For  token  indexing       J- 
:  char;      C  special  characters  token  J 
:BET  OF  char  =  r.#*BO   ..  #*B9  D; 
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Procedure 

BLANK;  C  collects  blank (5)  token 

VAR  i ndex : i nteger ; 

BEGIN 

i  ndex ; =0; 

•C  AO  Arabic  space  "blanks' 
■C  20  Latin  space  'blanks" 
WHILE  (  ORD(  line  11  where:)  =  *A0  )  OR 
(  ORD  (  line  TwhereD)  =  *20  )  DO 
BEG  I N 

i  ndex  :  =i  nde;;  +  1 ; 

taken (n  index.)  ;=  1 i ne  (. where.  )  ; 
where  :=  where+l; 
END; 

TQk_Type  :=  blanks; 
Tok_Len  n-    index; 
token  (.0.)  :=  CHR( index); 
END; 
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Procedure 
COMMENT; 

•C   Pracedure  comment  will  assign  the  matching  Latin 

brackets  and  the  body  o-f  the  comment  to  the  token. 

The  token  type  then  set  to  Comment.  > 

BEGIN 

tokenCl]  :  =  '(';  C  assign  the  opening  br^icket 
tokenC2Il  :=■■»■';            -C  and  asterisk  to  token 
index  :=   2  ;  C  start  of  comment  body- 
where  :=  where  +  2; 

REPEAT  C  assign  body  o-f  comment 

index:-  index  +  1;  L  pointer  o-f  token  buffer 

token  r i ndex ] ; =1 i ne  [where]; 

where  s=  where  +1;        i    pointer'  of  line  buffer 
UNTIL  (   (ORD  (lineCwherel   >  ^-    :fAA  )  AND 
(DRD  (1  lneCwhere-^-l  3 )  =  *A8  ))  OR 
(where   >=   Line_Bize   ); 

IF  (where  >=  Line_Si2:e)  THEN 

BEGIN  -C  The  end  of  line  is  reached 

Tok__Type  :~  Illegal  ;    i    before  closing  the  comment 
Error_Set:=  Error_Set  +    CLong_Comment D ; 
END 

ELBE  -C  the  comment  is  valid 

BEGIN 

token  C  i  ndex-+-l  3  :=  '-«■';  -C  assign  the  closing  bracket 
token  C  i  ndex+211  :=  '  )  '  ; 
Tok_Type  :=  coment; 

where  :=  where  +  2;    -L  advance  line  pointer 
Tok_Len   :=  index-i-2  ;  -C  adve^nce  token  pointer 
token  CO 3  ;=  chr (Tok_Len )  ;  i    set  token  length 
END; 
END;  -C  COMMENT  ]- 
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PROCEDURE 

LITERAL_STRING; 

i   **** *-»e-)t *#**■* ***-^*H^*-K-*** ***********  y 

c 
I. 

Literal  string  will  look  for  single  and  double  quotes. 
Matching  the  quote  char^^cter  at  the  beginning  and  the 
end  of  the  string.  Then  assigning  the  Laitin  quotation 
marks.  J- 

■C  *******************************************************  j- 

BEGIN 

i  ndex : =  0; 

CASE  ORD  (1  i  neC  where]  )  o-f    C  it  bu-F-fer  points  at  ;     > 

-1:"97  :  REPEAT  C  single  quotes  3- 

index  :=  index  +1; 
token [index  3  :=  lineC where]  ; 
where  :~  where  +1; 
UNTIL.  (GRDdineH  where])  =  $97  )  OR 
( wh er e  >  line_size)  ; 

••i'A2  :  REPEAT  C  double  quotes  > 

index  :=  index  +1; 
token  C  index]  :  •=  lineC  where]  ; 
where  :=  where  +  1; 
UNTIL  (ORDdineC  where])  =  *A2  )  OR 
(where  >  line_size); 
END;  C  CASE  J- 

■C  i -f  literal  ended  with    > 
•C  the  right  quote  mark;    > 
IF  (ORD  (1  ine  (.  where-  ))  =  :fA2)  OR 

(ORD (1 ine (. where. )) =  *97)  THEN 
BEGIN 

index      :=  index  +1;  C  advance  pointer  tor  the  > 
Tok_Len    :-  index;       C  quote  mark.  Set  length.  J- 
Tok_Type   :=  Literal_Str; 
token  CO]   :=  chr (Tok_Len )  ; 

i    -for  single  quote  literal} 
IF  (ORD  (token  CI])  --=  ^^97)   THEN 
BEGIN  C  assign  single  quotes     > 

token  CI]      :=chr(*27); 

token  C index]  :=  chr (*27) ; 
END; 

I F  ( ORD ( 1 1 n  e  C  wh  er  e ] ) =  * A2  ) THEN 

BEGIN  L  assign  double  quotes     > 

token  CI ]      :=chr (*22) ; 

token  [index]  :~  chr ($22); 
END; 

where  :=  where  +  1     ;       C  point  to  the  next  token  ]• 
END 
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ELSE 

BEGIN 

Error_Set 
Tol-::_Type 
Tok_Leri 
token  CO] 

END; 
END; 


=Error_Set 
=  ill egal ; 
=  i  ndex ; 
=  chr  (  i  nde; 


•C  i -f  line  pointer  did  not  aeej- 
C  si  ngl  e/doubl  e  qoute-  error  ]■ 
CLonq  Literal  StrD; 


set  length  o-f  tol:en 
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PROCEDURE 
INTEGER_TOK; 

•C    ***************-)*■**>■•«■***•«•*******■***-»-** -x*******^^  > 

•C    The    procedure    will     return    the    Digits    ranging 
•from    E<0    .  .     B9    Hex . 

BEGIN 

index  :  =  C) ; 

WHILE  (  line(.  where.)  in  digits  )  DO 
BEGIN; 

i  ndex  : =  index  +  1  ; 
token (. i ndex » )  :=  1 i ne (. where. ) ; 
where  :  ==  where  +1; 
END; 
Tok__Type  :-  integer  1; 
Tok_Len   : ~    i  ndex ; 
token  con  :=  chr( index); 
END; 

F'rocedure 
IDENTIF-'IER_TOK; 

C  The  procedure  will  look  -for  any  number  o+  digits  and 
underscore  characters  following  the  -first  letter. 

VAR  valid:  boolean; 
BEGIN 

i  ndex : -  0; 
REPEAT 

i  ndex : =  i  ndex  +  1 ; 

token ( . i  ndex .  )  : =  1 i  ne ( . wher e .  )  ; 
where:  =  where-i-1; 
UNTIL  not  (   DF^D  (1  i  ne  (.  where.  )  )  in  Arabic_alph  ); 
To k _Ty p e : "  I  den  t  i  t  i  er ; 
Tok _Len  : =  i  ndex ; 
tok(;;n  C  0  J  :  =  chr  (  i  ndex  )  ; 
END; 
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Procedure 

REBERVED_TOK 

(     VAR  match_i  ndex  :  reserved_i  nde;; )  ; 

•C   I-f  the  TOKEN  is  reserved  word.  The  procedure  will 

set  the   token  type  to  Reserved_Tok  and  pass  the 

index  of  the  word.   In  the  constant  array. 

-1. 

VAR    inde;;:  integer; 

hit   :  boolean;         C  when  a  match  is  -found    ]• 

BEGIN 

hit    :  =  -f  al  se; 

index  :  =-    1 ; 

WHILE  (index  <=  res_words  )  AND  (  not (hit))  DO 

BEGIN 

IF(  token  =  res_word ( . i ndex . ) . Arabi c )   THEN 
BEGIN  C  the  token  match  with    > 

hit  :=  true;  •[  reserved  word  > 

match_index  :=index; 
END; 

i  ndex : =  i  ndex  +  1  ; 
END;  C  while  no  hit  > 

IF  hit   THEN  Ci-f  token  is  reserved  word  J 

Tok_Type  :=  Reserved_word ;   C  set  the  token  type   > 
END; 
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Procedure 

SPEC I AL_CHAR_TOK ; 

■f  #******->^*********  ******#*****#*******#*******  *******> 

C  The  procedure  gets  all  the  the  tokens  o-f  one  char 

other  than  the  escape  codes.  > 

vAr     II  legal _Chars  : set  of  *21..*FF; 

BEGIN 

1 1 1 egal _Chars: =  C$21  ..*7E,C  Latin   chars  > 

■fS  1 .  .  $3D  ,  C  numeric  characters,  Arabic   > 
*92 ,       C  Arabic  &    character  > 

*97,*99, 

:$:9B.  .  *9F ,  C  non  used  characters  > 

.|:A1.  .*A2, 
*A4..*A5, 
■irAA,*AF, 
*BF  , 

:tGO..*CF  ] ;    C  Arabic  diacritics       > 
IF   orddine  (.where.))  in  1 1 1  egal  _chars   THEN 
BEGIN  C  Latin  characters  > 

Tok_Type: =  illegal; 

error_set :  =error_set  -i-  C  1 1 1  egal  _Char  ]  ; 
END- 
ELSE 
BEGIN 

token  C  1  D  :  =1  i  ne  Cwhere];  C  one?  character  specicil  char  ';■ 
where  :=  where  +1;  C  advance  line  painter  > 
Tok_Len    : =  1 ; 

token(.0.):-  chr  ( 1 ) ;  t  set  token  length  to  one  > 
Tok_Type   :=  Funct_Qperator  C  set  tokne  type  } 

END; 
END; 

Procedure 

CONTROL_CHARS; 

•C  *-)^*** *-x-* ********* ^x--)^****-****** ****** **-fr *******  } 

•C  control  characters  are    used  by  EiCON  and  will  be  omitted. 

C  *********** *******-x-**************** ***********************  } 
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BEGIN 

token  III]  :=  1  i  neC  where  U  ; 
Tok_Type  :=  contrl_cod; 
Tok_Len   ;=  1; 
i-f  DGbug_On  THEN 
BEGIN 

WRITELN(-  Control  character  (', ORD (1 i ne [where ]) , 

')  in  source  code'); 
WRITELN('  IN  Line  Number  ',  Line_No, 
n  Location  -   ' , where  ) ; 
END; 

where  :=  where  +  1; 
END; 

BEGIN;  C  TOKEN„AND_TYPE  *) 

•C  Based  an  the  first  character  o-f  the  token  call  an 

appropriate  module  to  collect  the  token  and  set  the  type.} 


Tok_Type  ;-  uncls-fd  ;  C  initialize  token  type  J- 

IF (ORD (1 ineC where])  =  *A9)AND   C  *A9  openings  bracket  J 

(ORD  (1  ineCwhereH-1  ])  =$AA)  THEN  C  :|:AA  is  asterisk  > 

COMMENT;  i    call  procedure  Comment  ".'>■ 

IF  Tok_Type  <>  coment  THENC  if  not  comment  THEN  based   > 
CASE   ORD  (1  i  neC  where] )  OF  •[  on  first  char  get  the  type  > 

*A0,*20   :  BLANK;         C  leading  space  (s)  3- 

*A2,*97   :  LITERAL_STRING; 

*B0..*B9  :  INTEGER__T0K;;  i    get  integer  token  > 

*DO..:$:FD  :  BEGIN  C  leading  letter  ]• 

IDENTIFIER_T0K;  C  is  it  user  defined/ 

reserved} 
RESERVED_TOK(match_ind) ; 
END; 


C0NTRGL_CHAR3;     C  control  characters 
SPECIAL  CHAR  TOK; 


*80 

,*3E, 

*aF 

,*90, 

$91 

ELS 

E 

END;      C 

case? 

END; 
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F'rocedure 
MAP_ I DEN_TQ_LAT I N 
(  token  :     Token_Str; 

lenth  :     integer; 

VAR    Latin_Id     :     Latin_Token     ); 


r 

module  name   :  Map_Iden_To_Lat i n 

date  created  :  April  30,1986 

calls         :  SEARCH 

called  by     :  MAIN 

variables     s 

token     =  scanned  identifier  token. 

lenth     =■•  length  o-f  scanned  identifier 

Latin_Id  =  the  translated  identifier  in  Latin  form 

last  change   :  Aug  2,  1986 

Comment 

The  Procedure  will  look  up  an  Arabic  identifier  if  not 
in  the  list  it  will  insert  the  Arabic  token  in  the  list. 
The  token  will  be  assigned  a  Latin  label  for  the  use  of 
the  PASCAL  compiler.  The  meaningless  label  will  have  the 
form  of  Id_###  .  Where  the  '#'  is  an  integer. 

Note;  code  segments  of  this  module  is  taken  from 

"PR INCH  HANSEN  ON  PASCAL  COMPILERS"  1985 
see  thesis  references 

■V. 

J 

C  *  )^)f-j!f -i*^  ■■*;!■*  S*i<- *->i- *■«-**  Jt-S-^-Jf*  *^#!^-^  y 
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Function  Hash_Key  •[  return   the  hash  key  of  J- 

(         token :  token_Str ;      C  the  i  denti -f  i  ersa         3- 

1 enth : 1 i ne_range 
)  : i  nteger ; 

CONST   W  =  32513;  C  3276B  -  255,  overflow  chek  J 

N  =  MaxKey;         C  Prime  number  for  words  size] 

VAR     Bum,i  : integer;     C  sum  is  the  token  ord.  value! 

BEGIN 

sum  :=  0; 

i    :=  1; 

WHILE   i  <>  lenth  DO 

BEGIN 

sum  :=  (sum  +ORD ( token ( . I , )  ))  MOD  W; 
i  :  =  i  H-  1 ; 
END; 

Hash_Key:=  (sum  MOD  N  )  +  1; 
END; 


F'rocedure  INSERT 
(    token  stoken_Str; 

1 enth : 1 i  ne_range; 

index  ; i  nteger ; 

KeyNo  : integer 

); 

VAR    m,n      :  integer; 

pointer  :  wordpointer; 
temp     s  L£it  i  n_token  ; 

PROCEDURE 

ID_NO(  VAR  Latin_id  :  Lat i n _token ) ; 
VAR 

TEMP  :  string  C33; 
BEGIN 

CASE  IDEN_NO  OF 
0  ..9     ::  BEGIN 

STR(Iden_no: 1 ,TEMP) ; 
Latin_id  : =  CONCAT ( ' id_ - ,  TEMP); 
END; 
10.. 99    :  BEGIN 

STR(Iden_No:2,TEMP) ; 
Latin_id  :=  CONCAT  (' i  d_  ',  TEMP)  :; 
END; 
100. .999  s  BEGIN 

STR( Iden_No:3,TEMP)  ; 

Lat  i  n__i  d  :  =  CONCAT  (  '  i  d__  '  ,  TEMP )  ; 
END: 
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END: 
END: 
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BEGIN 


characters 

m  :=  lenth; 

n  :=  characters  - 

WHILE   (m  >  0)  DO 

BEG  I N 

Arabi  cSpel 1  ( . 
m :  =  m  -  1 ; 
END; 

ID_NO (  temp) ; 
NEW (poi  nter ) ; 

poi  nter''-.  Lat  i  n 

pointer 

pai  nter 

poi  nter 

pointer 


=  characters 


•C  insert  Identifier 
•C  spelling  table 
+  lenth; 


1  n 


m; 


m+n 


) : =  token (. m. ) 


Next Word 

Index 

1  enth 

1 astchar 
WR I TELN ( d  i  c  t  i  on ar  y ,  ' 

pointer"'-.  Lat  in_Id  ,  ' 
Hash ( . KeyNo. )        :=  pointer; 


=  temp; 
:=  Hash ( 

:  index; 
1 enth ; 

:  characters 


Insert  word  record  info! 
. KeyNo „ ) ; 


'  , token 


END; 


FUNCT I  ON 

FOUND 

(  token   :  token_Btr; 

lenth   :  integer; 

pointer:  WordPointer 
) :  bool ean ; 


lenth  THEN 


VAR    same  :  boolean; 
m,n   ;;  integer; 

BEGIN 

IF  Poi  nter ■■■■.  1  enth  < 

same  :=  -false 
ELSE 
BEGIN 

same  : =  true; 

m  :=  lenth; 

n     :=  poi  nter •■"'■.  1  astchar  ~  m; 

WHILE  same  AND  (m  >  0  )  DO 

BEGIN 

same  :  ==  token  ( .  m,  )  =  Arabi  cBpel  1  (  .  m+n  .  )  ; 
m     :  =  m  ~  1  :; 
END; 
END ; 

FOUND  : =  same ; 
END; 
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Procedure 
Search 


( 


VAR 


); 


token 
lenth 
Lati  n 


Id 


token_Str ; 

integer;   C  token  length  > 

Latin  token    C  returned  Latin  tokc?n> 


C   Comment: 

The  modul 
token  key  and 
The  hash  ta 
word  records., 
location  in 
the  ne;;t  vsiord 
IF  the  fj o 
that  means  th 
word  must  be 
table.  Insert 
pointer  is  po 
records,  tunc 
FOUND  IS 


■&■»•*•*■**  ****-^**-*********-^-*f-^*-****-<^**  *****•*•« 


e  will  c  a 1 
then  look 
ble  con ten 
The  recor 
symbol  tab 
i.  n  the  1  i 
inter  resu 
e  word  is 
inserted  i 
ion  is  mad 
inting  at 
tion 
call ed  to 


1  -function  Hash_Key  to  get  the 

the  key  up  in  a  hash  table, 
t  is  pointers,  pointing  at 
ds  has  the  length  o-f  token  , 
le,  Latin   Identifier  number, 
nked  1  i  st.. 

Ited  ■from  the  Key  number  is  nil, 
not  in  the  table.  That  means  the 
f  there  is  room  in  the  spelling 
e  by  procedure  INSERT™  If  the 
a  record,  or  linked  list  of 

verify  the  spelling. 


**** -JS- *■«■ -s- -f^  ** -K- -iit -if  •*(- -X- *■****•* -je  * -s- * -Jt -x- * 


VAR   Key No 
done 
Pointer 


integer;   C  global  variables  for 
bool earn ; 
wordpoi  nter ; 


search: 


BEGIN  C  SEARCH  > 

KeyNo    : =  Hash_Key (token , 1 enth ) ; 
pointer  :=  hash (. KeyNo. ) ; 
done     : -    f al se; 
WHILE  not (done)  DO 

C  insert  new  id.  if  size  and  J- 
IF  (  pointer  =  nil  )  THEN C  and  number  within  limits  > 
BEGIN  C  add  identifier  > 

Iden_No  :=  Iden_No  +  1  ; 
INSERT  (token, lenth, Iden_No , KeyNo) ; 
Lati  n_  Id  :=  hash  (.  keyNo.  )'■.  Lati  n_Id  ; 
done  :=  true; 
END 


ELSE  IF  FOUND ( token, Tok_Len, pointer)  THEN 
BEGIN 

Latin_Id  :=  poi  nter''-.  Lati  n_  Id  ; 

done       :=true; 
END 
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ELSE 

pointer  :=  poi nter'. next word 
END: 
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BEGIN; 
SEARCH 
END:   C 


Map_Iden_Ta_Lat i  n 


(Token,  Tok_Len ,  Latin_Id); 
MAP-IDENTIFIER--TO-LATIN  J 


PROCEDURE 
GET  LATIN 


SPEC 


_CHAR 

token 
VAR  Latin  char 


:  token_Sti 
:  char 


VAR  Arb_ 

_c 

har  :  stri  nq  C  1  ]  ;; 

BEGIN 

Arb_Char 

■ 

=token  ( .  i  .  )  ; 

CASE  ORD  (  Arb_Char  ) 

OF 

• 

*BC 

Lati  n_char 

:  = 

*  i" 

1   L 

Ar 

abi  c 

greater  than 

*BE 

Lat i  n_char 

„  „ 

Ar 

aqbic  leas  than 

*93 

Lati  n_char 

:  = 

'  ]  •■ 

1)   L 

Ar 

abi  c 

square  bracket 

*94 

Lati  n  _char 

3  ::- 

'  C  ' 

y    " 
1   ^ 

tA8 

Lati  n__char 

:  = 

'  )  ' 

,  r 
1  ^ 

Arabi  c 

RIGHT  p ar en thesi 

*A9 

Lat i  n_char 

:  - 

'  (  • 

=  =: 

-=: 

LEFT     ==-== 

:fAB 

Lati  n__char 

:  -- 

'  +  ' 

1   *- 

Ar 

abi  c 

Plus 

*AD 

Lati  n_char 

:  = 

Mi  nus 

*A7 

Lati  n_char 

:  = 

■/  ' 

1    "^ 

Di  vi  de 

*96 

Lat in_ch ar 

;  = 

1  ^ 

Under_Score 

*A3 

Lati  n _c hat- 

:  = 

'  *  ' 

1  '^ 

Multiply 

it-BA 

La  ti  n_char 

:  =■ 

; 

Col  on 

*BD 

Lat i  n_char 

:  = 

'  =  ' 

1  "^ 

Equal 

*AE 

Lati  n  _char 

:  = 

m 

!  ^ 

Numer  i  c  comma 

:ii:-95 

Lat i  n_char 

:  =: 

,  r 
1  ^ 

Hat 

*A6 

Lati  n_char 

s  -- 

m 

,  r 
1  ^ 

Peri  od 

■I^BB 

Lat  i  n_char 

:  = 

'     -   ' 

1  •- 

Semi  col  on 

*AC 

Lati  n_char 

:  = 

7 

1   ^ 

Comma 

END", 
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END; 

Procedure 
LATIN_INT 

(      token   :  t  o k  en _S t. r  ; 
Tak_Len : 1 i  ne_range; 
VAR  Lat _ I n t : St r 1 0        )  ; 

VAR  ind  :  integer; 

BEGIN  C  for  each  digit  map  to 

C  Lat.  i  n  digit 


for  ind; 

= 

1  to  Tok_Len 

DO 

CASE  ORD(token (, ind.  ) 

of 

•i-BO 

; 

Lat_Int („ ind„ 

a  ^~ 

'  0 

*B1 

s 

Lat_J.nt  (.ind. 

:  = 

■  1 

*B2 

r, 

Lat_Int ( . ind . 

:  = 

■  — r 

-t-BZ 

m 

Lat_Int ( . ind. 

:  = 

•   "T 

fB4 

: 

Lat_Int („ ind. 

:  = 

'4 

:fB5 

; 

Lat_ Int ( . ind. 

:  ~ 

'  Zi 

$66 

; 

Lat_Int ( . i  nd . 

;  = 

'6 

•-fB7 

Lat_Int ( . ind. 

:  = 

,  -., 

*BS 

Lat__Int  ( „  i  nd  . 

:  = 

'8 

*B9 

Lat_Int(.ind. 

;  = 

'9 

END; 

Lat  Int( 

.  * 

:>.  )  :  =  token  (  . 

( J . 

set  length  o-f  token 

END; 

PROCEDURE 

PR  I  NT._ERROR_MESSAGES ; 

V3.r    ind     :     inte?ger; 

BEGIN 

WRITELN   ('**-x-  ERROR  ON  LINE  NO.   ',line_no); 

for  ind  ;=  1  to  line_size  do  write  (  line (.ind.)  ); 

WRITELN; 

IF  long_token    IN  err or _Bet  THEN 

WRITELN  ('  has  long  token  s-**  ■  ,  token); 
C   IF  1  ong_comnrient  IN  err  or  __set.  THEN 

WRITELN(''  has  long  comment*** ',  token  )  ;  J- 

IF  1 onq _1 i ter al  _Str   in  err or _set  THEN 

WRITELN('  UNCLOSED  QUOTES  '); 
IF  Illegal,_Char  IN  err  or  _set  THEN 

WRITELN  ('---===:=::-:=-.■  Character  number   '  ,  Nex  t__Loc 
is  out  of  range=="~-=~-"===  '  )  ; 

END; 
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•L  main  J- 

BEGIN 

OPEN_FILE; 
INITIALIZE; 

While  not  (eof  (in-f  i  le)  )  AND  (  error_5et  =  C3)  DO 
BEGIN  C   Line  process  > 

FILL_BUFFER (1 i  ne , nex t_l oc,Iine_no,line_size)  ; 
WHILE  not  (  BUFFER_EMPTY  (  ne;;  t_l  oc  ,  1  i  ne_si  ze)  )  AND 

(  error_set  =  CD)  DO 
BEGIN  \    Token  process  > 

TOKEN_AND_TYPE (next _loc , token ,Tok_Len , 

Tak_Type , Match_Ind ) ; 
IF  Debug_On  THEN 

WRITEiLN  (' token  =  ',  token  ,' 1  enght=  ', 

Tok_Len  ,  '   Ne;;t_Loc  =  '  ,  Ne;;  t_Loc  )  ; 

CASE  To k_ Type  of 
blanks        c  FOR  i  :=  1  to  Tok_Len  DO 

wr  i  t  e  ( out  -file,'   '  )  ; 
coment         :  IF  Comment_On  THEN 

wr  i  t  e  ( out  f  i  1  e  ,  t  o k en  )  ; 
literal_Str   :  wr i te (outt i 1 e , token ) ; 
reserved_word  s  wr i  te (QUTFILE , 

res_word ( . match_i  nd .  )  . Engl i  sh )  ; 
identi-fier    :  IF  (Iden_Na  <  1000)  AND 

(characters  <  Ma;; Char  )  THEN 
BEGIN 

MAP_ I DEN_TO_LAT I N  (token  ,  Tok_..Len  ,  Lat  i  n^. I d  ) 
wr  i  t  e  ( out -f  i  1  e  ,  Lat  i  n__  I  d  )  ; 
END; 
integer  1      :  BEGIN 

LATIN_INT(token,Tok_Len,Lat_Int)  ; 
wr  i  t e ( ou t f  i 1 e , Lat _ I n t ) ; 
END ; 
■f  un c  t  _op  er  a t  or  :     E-iEG  I N 

GET_LATIN_GPEC_CHAR  (token , Lat i n_char ) ; 
wr  i  te  (out-f  lie,  Lat  i  n_char  )  ; 
END; 
contrl__cod     ;  WRITELN  (  '  1  i  ne  %line__no:4  , 

'  Control  code  was  ignored  ')  ; 
illegal         :  BEGIN 

PRINT_ERROR_MESSAGEB  ; 
END; 
END;  •[  CAGE  J- 

END;     C  WHILE  TOKEN  J- 
WRITELN (outfi le) ; 
END; 

IF  error__set<>  C  ]  THEN  WRI TELN  (' error  on  token  type'); 
CLOSE (  outfi le) ; 
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CLOSE  (  in-f  ile)  ; 
CLOSE (  dictionary  ) 
END. 
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APPENDIX  I 
TEST  RUNS 


Test  Run  1 


l^ 


...JLJr.JI   J  7   C  d_..,.J.l-. J  I  ^^^ 


j?Jb_i_o  :  J  pj_'_z-tJ  I 

i    =L_j  I ^ 

j  (  __i_o-JcJ  i  ,  i:  _^6:--^c,  ]  cL  ..._iJb  i  i^u..-^^  I  .  C  j.o.'y.o  D  4_.J±>  )   1  _.>Js  ] 


.  ^—1  Uh 


Source  Code 
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id_l  ■|V_._U-i_Jl 

id_2  =(_._iJL-.J  1_^ 

id_3  v-"-J=--'-J  I 

id_4  ^u.-T/l 

i  d  _  5  _j_o_3tJ  I 

i  d_6  (_MJ_i_:-U  i 

id_7  <=<-:-'-b 

id  S  Lu:.-Vo 


Test  Run  1  Dictionary  Table 


program  i  d_ 1 ; 

const    id_2   =  32  ; 
type  id_3  =  record 

id__4   s  string  C30I1  ; 
id_5  ;   integer  ; 
id_6  :  boolean  ; 
end  ; 
var    id_7  :  eirray    I     1.  .  id_2]  of  id_3  5 

id_S  :  integer; 
begin 

while  (  id__8  <  32)  do 
begi  n 

id__8  :=  id_8  +  1  -, 

read  (  i  d  _7C  i  d  _a]  .  i  d_4  ,  i  d_,7[  1  d_a:] .  i  d_5)  ; 
write  (  id_7[:id_..£n.  id__4,id_7Cid„a:  .  id_5)  ; 
end; 
end . 


Generated  Code 
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Test  Run   2  S  1  V_v_UJ-' I  _  ?•  l.o.u..  i    ^-:>\-j_ 

:    TV    =      <=*-:-.i±>-J  1  _  ji  j>^         o_j  LJ- 

0_>J       =      v-J   LJb  J    I        _;   1  _jl? 

i    C  T  +  ]    <=<-i_u..J_u..    :       ^u..-^j  I 

li    j<>j..b_i_o     :     .j.;j_i_:-J  I 

;    <i_.  L-4-J 

S     V-J  l-Jt'-J  I      J1      C<^~;~l-b_)  |_ji..i^     .  .    I      ]     ui^->     :     <:i_._l±>  _>-.^ 

^^  1 . 
,^j_?cj>  j[     (TV    •••■    _>-'J>5-'°    ^  '--•^-w— ' 

"      1      "•"     _>u:'^.o     =  ;      j-Oj^o 

:  ( _j....:>.j^J  I  .  I  _,f-6:.y..o  1  cL.,^\±-  i  f..',..V  I  .  C  ^<:::.3^o  1  =L_._iJb    )     ._._._i-  ] 


Source  Code 


Dictionary   Table 


^:  i_^ 


^_^Mi.J  I  _  f.  L.O.U..  I  i  d  _  1 

yj  UbJ  I  id_3 

_j_ojcJ  1  i  d_5 

,jjj->-^  I  i  d_6 

<_._!_b  id_7 
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Test  Run  2 


program  i  d_l ; 

const.    id_2   =  32  ; 
type  ici_3  =  record 

id_4   :  string  C30]  ; 
id_5  :   integer  ; 
id_6  :  boolean  ; 
end  ; 
WAr  id_7  :  array  C  1.  .  id_23  o-f  id__3  ; 

id_S  :  integer; 
begi  n 

while (  id_8  <  32)  do 
begi  n 

id_8  '.-    id_8  +  1  ; 

read  (  i d_7C i d_3: . i d_4 , i d_7C i d_ei . i d_5) ; 
write  (  id_7Cid_8:. id_4, id_7Cid_8]. id_5) ; 
end ;; 
end . 


Generated  Code 
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Test  Run  3  .C  +  c5.>^^'J- 

S  I  O   =   ~l_d_._lJ.-i_J  I  _j.  jj_-   G-j  L„J 

S  C  O  +  3  d_i_u,j_.j,  :   |0,j, y  |  j_._^„._o 
*  C  ]  <r  3  <=t--'-.j....Lu..  :  Li-,-  L_^.i  1 

^-:  I  ^^ 

S  d>j,- 1_^ J  1  X  I  r\  ,  V  =  :  v_M-  ^   I 
i  ((.i-,!_4_J  I  i  ^u/'i'  I  )  _j.b_u.._v-'-i:  i 

.  :( ,  L_4_J 


Source  Code 
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Test  Run  3 


program  i  d _ 1  (input , ou t  p  u t )  s 
const  id_2  =  15; 
var    id_3   ;  stringC503; 
.id_4:  stringC123; 
1  d_5  :  real  ; 

begi  n 

id__5:=  122.7  *  id_4  ; 
i  d_3  s  =  '0 '— o-.-^-uJ  I  j_-;  jjcJ  1  jv-'_ii  ^-J  ^—^    t5-i  ^-^ 
id_4:=-£YlV*or  '  ; 

concat  (  id_3,id__4); 
writeln  (id_3,id_4); 

end . 


Generated   Code 


i  d  _  1  ^^..y-r 

id_2  -l_d_,_i±>J  [ 

id_3  f>~^.'V\ 

id_4  .-i--i_4J  I 

ids  Ji-j>J  I 


Dictionary  Table 
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